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Preface 


"Classification and Data Science in the Digital Age", the 17th Conference of the In- 
ternational Federation of Classification Societies (IFCS), is held in Porto, Portugal, 
from July 19th to July 23rd 2022, locally organised by the Faculty of Economics of 
the University of Porto and the Portuguese Association for Classification and Data 
Analysis, CLAD. 


The International Federation of Classification Societies (IFCS), founded in 1985, 
is an international scientific organization with non-profit and non-political motives. 
Its purpose is to promote mutual communication, co-operation and interchange of 
views among all those interested in scientific principles, numerical methods, theory 
and practice of data science, data analysis, and classification in a broad sense and in as 
wide a range of applications as possible; to serve as an agency for the dissemination 
of scientific information related to these areas of interest; to prepare international 
conferences; to publish a newsletter and other publications. The scientific activities 
of the Federation are intended for all people interested in theory of classification 
and data analysis, and related methods and applications. IFCS 2022 — originally 
scheduled for August 2021, and postponed due to the Covid-19 pandemic — will be 
its 17th edition; previous editions were held in Thessaloniki (2019), Tokyo (2017) 
and Bologna (2015). 


Keynote lectures are addressed by Genevera Allen (Rice University, USA), Charles 
Bouveyron (Université Cóte d'Azur, Nice, France), Dianne Cook (Monash Univer- 
sity, Melbourne, Australia), and Joao Gama (Faculty of Economics, University of 
Porto & LIAAD INESC TEC, Portugal). The conference program includes two 
tutorials: “Analysis of Data Streams" by João Gama (Faculty of Economics, Univer- 
sity of Porto & LIAAD INESC TEC, Portugal) and “Categorical Data Analysis of 
Visualization" by Rosaria Lombardo (Università degli Studi della Campania Luigi 
Vanvitelli, Italy) and Eric Beh (University of Newcastle, Australia). IFCS 2022 has 
highlighted topics, which lead to Semi-Plenary Invited Sessions. The Conference 
program also includes Thematic Tracks on specific areas, as well as free contributed 
sessions in different topics (both oral communications and posters). 


vi Preface 


The Conference Scientific Program Committee is co-chaired by Paula Brito, José G. 
Dias, Berthold Lausen, and Angela Montanari, and includes representatives of the 
IFCS member societies: Adalbert Wilhelm — GfKl, Ahmed Moussa — MCS, Arthur 
White — IPRCS, Brian Franczak — CS, Eva Boj del Val — SEIO, Fionn Murtagh — 
BCS, Francesco Mola - CLADAG, Hyunjoong Kim - KCS, Javier Trejos Zelaya — 
SoCCCAD, Koji Kurihara — JCS, Krzysztof Jajuga - SKAD, Mark de Rooij - VOC, 
Mohamed Nadif — SFC, Niel le Roux — MDAG, Simona Korenjak Cerne - SSS, 
Theodore Chadjipadelis - GSDA, who were responsible for the Conference Scien- 
tific Program, and whom the organisers wish to thank for their precious cooperation. 
Special thanks are also due to the chairs of the Thematic Tracks, for their invaluable 
collaboration. 


The papers included in this volume present new developments in relevant topics 
of Data Science and Classification, constituting a valuable collection of method- 
ological and applied papers that represent the current research in highly developing 
areas. Combining new methodological advances with a wide variety of real appli- 
cations, this volume is certainly of great value for Data Science researchers and 
practitioners alike. 


First of all, the organisers of the Conference and the editors would like to thank 
all authors, for their cooperation and commitment. We are specially grateful to all 
colleagues who served as reviewers, and whose work was decisive to the scientific 
quality of these proceedings. We also thank all those who have contributed to the de- 
sign and production of this Book of Proceedings at Springer, in particular Veronika 
Rosteck, for her help concerning all aspects of publication. 


The organisers would like to express their gratitude to the Portuguese Association 
for Classification and Data Analysis, CLAD, as well as to the Faculty of Economics 
of the University of Porto (FEP-UP), who enthusiastically supported the Conference 
from the very start, and contributed to its success. We cordially thank all members 
of the Local Organising Committee — Adelaide Figueiredo, Carlos Ferreira, Carlos 
Marcelo, Conceição Rocha, Fernanda Figueiredo, Fernanda Sousa, Jorge Pereira, 
M. Eduarda Silva, Paulo Teles, Pedro Campos, Pedro Duarte Silva, and Sónia Dias 
— and all people at FEP-UP who worked actively for the conference organisation, 
and whose work is much appreciated. We are very grateful to all our sponsors, for 
their generous support. Finally, we thank all authors and participants, who made the 
conference possible. 


Porto, Paula Brito 
July 2022 José G. Dias 
Berthold Lausen 


Angela Montanari 
Rebecca Nugent 
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A Topological Clustering of Individuals 


Rafik Abdesselam 


Abstract The clustering of objects-individuals is one of the most widely used ap- 
proaches to exploring multidimensional data. The two common unsupervised cluster- 
ing strategies are Hierarchical Ascending Clustering (HAC) and k-means partitioning 
used to identify groups of similar objects in a dataset to divide it into homogeneous 
groups. The proposed Topological Clustering of Individuals, or TCI, studies a homo- 
geneous set of individual rows of a data table, based on the notion of neighborhood 
graphs; the columns-variables are more-or-less correlated or linked according to 
whether the variable is of a quantitative or qualitative type. It enables topological 
analysis of the clustering of individual variables which can be quantitative, qualita- 
tive or a mixture of the two. It first analyzes the correlations or associations observed 
between the variables in a topological context of principal component analysis (PCA) 
or multiple correspondence analysis (MCA), depending on the type of variable, then 
classifies individuals into homogeneous group, relative to the structure of the vari- 
ables considered. The proposed TCI method is presented and illustrated here using 
a real dataset with quantitative variables, but it can also be applied with qualitative 
or mixed variables. 


Keywords: hierarchical clustering, proximity measure, neighborhood graph, adja- 
cency matrix, multivariate data analysis 
1 Introduction 


The objective of this article is to propose a topological method of data analysis in the 
context of clustering. The proposed approach, Topological Clustering of Individuals 
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(TCD is different from those that already exist and with which it is compared. There 
are approaches specifically devoted to the clustering of individuals, for example, the 
Cluster procedure implemented in SAS software, but as far as we know, none of 
these approaches has been proposed in a topological context. 

Proximity measures play an important role in many areas of data analysis [16, 5, 9]. 
The results of any operation involving structuring, clustering or classifying objects 
are strongly dependent on the proximity measure chosen. 

This study proposes a method for the topological clustering of individuals what- 
ever type of variable is being considered: quantitative, qualitative or a mixture of 
both. The eventual associations or correlations between the variables partly depends 
on the database being used and the results can change according to the selected prox- 
imity measure. A proximity measure is a function which measures the similarity or 
dissimilarity between two objects or variables within a set. 

Several topological data analysis studies have been proposed both in the context 
of factorial analyses (discriminant analysis [4], simple and multiple correspondence 
analyses [3], principal component analysis [2]) and in the context of clustering of 
variables [1], clustering of individuals [10] and this proposed TCI approach. 

This paper is organized as follows. In Section 2, we briefly recall the basic 
notion of neighborhood graphs, we define and show how to construct an adjacency 
matrix associated with a proximity measure within the framework of the analysis 
of the correlation structure of a set of quantitative variables, and we present the 
principles of TCI according to continuous data. This is illustrated in Section 3 using 
an example based on real data. The TCI results are compared with those of the well- 
known classical clustering of individuals. Finally, Section 4 presents the concluding 
remarks on this work. 


2 Topological Context 


Topological data analysis is an approach based on the concept of the neighborhood 
graph. The basic idea is actually quite simple: for a given proximity measure for 
continuous or binary data and for a chosen topological structure, we can match a 
topological graph induced on the set of objects. 

In the case of continuous data, we consider E — {x!, ex], xP}, a set of p 
quantitative variables. We can see in [1] cases of qualitative or even mixed variables. 

We can, by means of a proximity measure u, define a neighborhood relationship, 
V,, to be a binary relationship based on E x E. There are many possibilities for 
building this neighborhood binary relationship. 

Thus, for a given proximity measure u, we can build a neighborhood graph on E, 
where the vertices are the variables and the edges are defined by a property of the 
neighborhood relationship. 

Many definitions are possible to build this binary neighborhood relationship. One 
can choose the Minimal Spanning Tree (MST) [7], the Gabriel Graph (GG) [11] or, 
as is the case here, the Relative Neighborhood Graph (RNG) [14]. 
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For any given proximity measure u, we can construct the associated adjacency 
binary symmetric matrix V,, of order p, where, all pairs of neighboring variables in 
E satisfy the following RNG property: 


1 dfu(x*, x) < max[u(x* , x") u(x , x]: 
Vx. , x!) = Vx*, xl, x! € E, x! 2 x* and x zx! 
O otherwise. 


REE — m HE» HE 
[x] ; 7 i N 1 0 0 1 0 1 0 0 E 
2 f NI | EN 1 1 0 0 0 0 1 
x | -0.5 | 165 i \ Le | ` 
pz = x |1635 0 1 0 0 1 0 off 
X| o | 125 xo 15 Vd NJA BR 
5 x8 a X umwa || 1081 oso o 1 o o o o [Ra 
[e] 1 | 175 1 D AN E 0.750 1.503 1118 0 1 0 o 1 [x 
»]-125 1 \\ ] x |2.250 0.992 1.275 2.372 0 1 0 0 [5] 
N | 0860 1401 0808 1433 1629 0 | 1 af 
03 | 05 o5 xe / 
E 2.462 0.828 1458 2.264 1 2157 0 | 1 | 
325| 2 x 12.062 0.522 1.031 2.016 0.559 1.640 0.559 0 
o 
EM -1 |15 u e 


Fig. 1 Data - RNG structure - Euclidean distance - Associated adjacency matrix. 


Figure 1 shows a simple illustrative example in R? of a set of quantitative variables 
that verify the structure of the RNG graph with Euclidean distance as proximity 


measure: u(x* , x!) = as - 45 
This generates a topological structure based on the objects in E which are com- 
pletely described by the adjacency binary matrix V,,. 


2.1 Reference Adjacency Matrices 


Three topological factorial approaches are described in [1] according to the type of 
variables considered: quantitative, qualitative or a mixture of both. We consider here 
the case of a set of quantitative variables. 

We assume that we have at our disposal a set E = (x/^;j = 1,---,p} of p 
quantitative variables and n individuals-objects. The objective here is to analyze in 
a topological way, the structure of the correlations of the variables considered [2], 
from which the clustering of individuals will then be established. 

We construct the reference adjacency matrix named V,,, from the correlation 
matrix. Expressions of suitable adjacency reference matrices for cases involving 
qualitative variables or mixed variables are given in [1]. 

To examine the correlation structure between the variables, we look at the sig- 
nificance of their linear correlation. The reference adjacency matrix V,,, associated 
with reference measure u,, can be written using the Student's t-test of the linear 
correlation coefficient p of Bravais-Pearson: 
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Definition 1 For quantitative variables, V,,, is defined as: 


lif p-value = P[ | T,-2 | > t-value ] < æ; Vk,l=1,p 


k V n 
Va (0020) = ta otherwise. 


where the p-value is the significance test of the linear correlation coefficient for 
the two-sided test of the null and alternative hypotheses, Ho : p(xk ; x!) = 0 vs. 
H; : p(x*, x!) #0. 

Let T,,-2 be a t-distributed random variable of Student with v = n — 2 degrees of 
freedom. In this case, the null hypothesis is rejected if the p-value is less than or equal 
to a chosen «a significance level, for example, a = 5%. Using a linear correlation 
test, if the p-value is very small, it means that there is a very low likelihood that the 
null hypothesis is correct, and consequently we can reject it. 


2.2 Topological Analysis - Selective Review 


Whatever the type of variable set being considered, the built reference adjacency 
matrix V,,, is associated with an unknown reference proximity measure ux. 

The robustness depends on the a error risk chosen for the null hypothesis: no 
linear correlation in the case of quantitative variables, or positive deviation from 
independence in the case of qualitative variables, can be studied by setting a minimum 
threshold in order to analyze the sensitivity of the results. Certainly the numerical 
results will change, but probably not their interpretation. 


We assume that we have at our disposal (x*; k = 1, .., p} aset of p homogeneous 
quantitative variables measured on n individuals. We will use the following notations: 

- X(n,p) is the data matrix with n rows-individuals and p columns-variables, 

- V,, is the symmetric adjacency matrix of order p, associated with the reference 
measure u, which best structures the correlations of the variables, 

- Xs p) = XVu, is the projected data matrix with n individuals and p variables, 

- M, is the matrix of distances of order p in the space of individuals, 

-Dn = itn is the diagonal matrix of weights of order n in the space of variables. 


We first analyze, in a topological way, the correlation structure of the variables 
using a Topological PCA, which consists of carrying out the standardized PCA [6, 8] 
triplet ( X,M p» Dn) of the projected data matrix X=X Va, and, for comparison, 
the duality diagram of the Classical standardized PCA triplet (X, Mp, Dn ) ofthe 
initial data matrix X. We then proceed with a clustering of individuals based on the 
significant principal components of the previous topological PCA. 


Definition 2 TCI consist of performing a HAC, based on the Ward criterion! [15], 
on the significant factors of the standardized PCA of the triplet (X, Mp, Dn). 


! Aggregation based on the criterion of the loss of minimal inertia. 
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3 Illustrative Example 


The data used [13] to illustrate the TCI approach describe the renewable electricity 
(RE) of the 13 French regions in 2017, described by 7 quantitative variables relating 
to RE. The growth of renewable energy in France is significant. Some French regions 
have expertise in this area; however, the regions' profiles appear to differ. 

The objective is to specify regional disparities in terms of RE by applying topo- 
logical clustering to the French regions in order to identify which were the country's 
greenest regions in 2017. Statistics relating to the variables are displayed in Table 1. 


Table 1 Summary statistics of renewable energy variables. 


Standard Coefficient of 
Variable Frequency Mean Deviation (N) variation (%) Min Max 
Total RE production (TWH) 13 6.84 6.58 96.19 0.59 2.34 
Total RE consumption (TWH) 13 3.70 1.87 50.67 2.18 7.06 
Coverage RE consumption (%) 13 0.18 0.11 59.01 0.02 0.36 
Hydroelectricity (%) 13 0.34 0.30 87.47 0.01 0.89 
Solar electricity (%) 13 0.13 0.09 72.57 0.02 0.31 
Wind electricity (%) 13 0.39 0.29 76.12 0.01 0.86 
Biomass electricity (%) 13 0.15 0.19 130.54 0.01 0.79 


Table 2 Correlation matrix (p-value) - Reference adjacency matrix V,,, . 


Production 1.000 
Consumption 0.575 1.000 
(0.040) 
Coverage 0.798 — 0.000 — 1.000 BEI m 
110 000 0 
(0.001) (0.771) 101 100-1 
Hydroelectricity| 0.720 0.138 0.872 1.000 v _ 21101 10-10 
(0.006) (0.653) (0.000) Wk 000 010 0 
Solar -0.272  -0.477 0.105 0.168 1.000 000 4101 0 
(0.369) (0.099) (0.734) (0.582) 00-1000 1 
Wind -0.408 -0.305 -0.524 -0.772  -0.395 1.000 
(0.167) (0.311) (0.066) (0.002) (0.181) 
Biomass -0.365 0.489 -0.609 -0.459 -0.149 -0.135 1.000 
(0.220) (0.090) (0.027) (0.114) (0.627) (0.660) 


Significance level: p-value € a = 5% 


The adjacency matrix V,,, associated with the proximity measure ux, adapted 
to the data considered, is built from the correlations matrix Table 2 according to 
Definition 1. Note that in this case, which uses quantitative variables, it is considered 
that two positively correlated variables are related and that two negatively correlated 
variables are related but remote. We will therefore take into account any sign of 
correlation between variables in the adjacency matrix. 

We first carry out a Topological PCA to identify the correlation structure of the 
variables. A HAC, according to Ward's criterion, is then applied to the significant 
principal components of the PCA of the projected data. We then compare the results 
of a topological and a classical PCA. 

Figure 2 presents, for comparison on the first factorial plane, the correlations 
between principal components-factors and the original variables. 


6 R. Abdesselam 


We can see that these correlations are slightly different, as are the percentages of 
the inertias explained on the first principal planes of Topological and Classic PCA. 


Topological PCA Classical PCA 
1 eet pc i X e 
an Solar electricity 075 à Solar electricity 
E * 4 
£ os $ o5 
= 8 | 
P ` À | 
PET Coverage RE Consumption S 025 , | Coverage RE Consumption 
t —— B H į Wind electricity gems 
E £ | —— INT s 
FEE Aec— Hydroelectricity E o -p Hydroelectricity 
ri i / EN 4 j 
di -0.25 / ~~ Total RE Production  -0.25 
ài / : / a \ ~ 
o g / " 
2 os - 3 os V \ Total RE Production 
Total RE Consumption / / 
Wind electricity , / 
-0,75 L -075 Biomass electricity 


Biomass electricity — Ee {Total RE Consumption 


- 075 05 02 0 025 05 075 1 -4 075 -05 -025 0 025 05 075 1 
Axis 1: Explained Inertia 57.89% Axis 1: Explained Inertia 47.89% 


Fig. 2 Topological & Classical PCA of RE of the French regions. 


The two first factors of the Topological PCA explain 57.89% and 26.11%, re- 
spectively, accounting for 83.99% of the total variation in the data set; however, the 
two first factors of the Classical PCA add up to 75.20%. Thus, the first two factors 
provide an adequate synthesis of the data, that is, of RE in the French regions. We 
restrict the comparison to the first significant factorial axes. 

For comparison, Figure 3 shows dendrograms of the Topological and Classical 
clustering of the French regions according to their RE. Note that the partitions chosen 
in 5 clusters are appreciably different, as much by composition as by characterization. 
The percentage variance produced by the TCI approach, R? = 86.42%, is higher than 
that of the classic approach, R? = 84.15%, indicating that the clusters produced via 
the TCI approach are more homogeneous than those generated by the Classical one. 

Based on the TCI analysis, the Corse region alone constitutes the fourth cluster, 
and the Nouvelle-Acquitaine region is found in the second cluster with the Grand- 
Est, Occitanie and Provence-Alpes-Cóte-d'Azur (PACA) regions; however, in the 
Classical clustering, these two regions - Corse and Nouvelle-Aquitaine - together 
constitute the third cluster. 

Figure 4 summarizes the significant profiles (+) and anti-profiles (-) of the two 
typologies; with a risk of error less than or equal to 5%, they are quite different. 

The first cluster produced via the TCI approach, consisting of a single region, 
Auvergne-Rhônes-Alpes (AURA), is characterized by high share of hydroelectricity, 
a high level of coverage of regional consumption, and high RE production and con- 
sumption. The second cluster - which groups together the four regions of Grand-Est, 
Occitanie, Provence-Alpes-Cóte-d'Azur (PACA) and Nouvelle-Aquitaine - is consid- 
ered a homogeneous cluster, which means that none of the seven RE characteristics 
differ significantly from the average of these characteristics across all regions. This 
cluster can therefore be considered to reflect the typical picture of RE in France. 
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Fig. 3 Topological and Classical dendrograms of the French regions. 
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Fig. 4 Typologies - Characterization of TCI & Classical clusters 


Homogeneous 


Cluster 3, which consists of six regions, is characterized by a high degree of wind 
energy, a low degree of hydroelectricity, low coverage of regional consumption, and 
low production and consumption of RE compared to the national average. Cluster 
4, represented by the Corse region, is characterized by a high share of solar energy 
and low production and consumption of RE. The last class, represented by the Ile- 
de-France region, is characterized by a high share of biomass energy. Regarding the 
other types of RE, their share is close to the national average. 
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4 Conclusion 


This paper proposes a new topological approach to the clustering of individuals which 
can enrich classical data analysis methods within the framework of the clustering of 
objects. The results of the topological clustering approach, based on the notion of a 
neighborhood graph, are as good - or even better, according to the R-squared results 
- than the existing classical method. The TCI approach is be easily programmable 
from the PCA and HAC procedures of SAS, SPAD or R software. Future work will 
involve extending this topological approach to other methods of data analysis, in 
particular in the context of evolutionary data analysis. 
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Model Based Clustering of Functional Data with 
Mild Outliers 


Cristina Anton and Iain Smith 


Abstract We propose a procedure, called CFunHDDC, for clustering functional data 
with mild outliers which combines two existing clustering methods: the functional 
high dimensional data clustering (FunHDDC) [1] and the contaminated normal mix- 
ture (CNmixt) [3] method for multivariate data. We adapt the FunHDDC approach 
to data with mild outliers by considering a mixture of multivariate contaminated nor- 
mal distributions. To fit the functional data in group-specific functional subspaces 
we extend the parsimonious models considered in FunHDDC, and we estimate the 
model parameters using an expectation-conditional maximization algorithm (ECM). 
The performance of the proposed method is illustrated for simulated and real-world 
functional data, and CFunHDDC outperforms FunHDDC when applied to functional 
data with outliers. 


Keywords: functional data, model-based clustering, contaminated normal distribu- 
tion, EM algorithm 


1 Introduction 


Recently, model-based clustering for functional data has received a lot of attention. 
Real data are often contaminated by outliers that affect the estimations of the model 
parameters. Here we propose a method for clustering functional data with mild 
outliers. Mild outliers are usually sampled from a population different from the 
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assumed model, so we need to choose a model flexible enough to accommodate 
them. 

Functional data live in an infinite dimensional space and model-based methods 
for clustering are not directly available because the notion of probability density 
function generally does not exist for such data. A first approach is to use a two- 
step method and first do a discretization or a decomposition of the functional data 
in a basis of functions (such as Fourier series, B-splines, etc.), and then directly 
apply multivariate clustering methods to the discretization or the basis coefficients. 
A second approach, which allows the interaction between the discretization and the 
clustering steps, is based on a probabilistic model for the basis coefficients [1, 2]. 

We follow the second approach, and we propose a method, called CFunHDDC, 
which extends the functional high dimensional data clustering (FunHDDC) [1] to 
clustering functional data with mild outliers. There are several methods to detect 
outliers of functional data and a robust clustering methodology based on trimming 
is presented in [4]. Our approach does not involve trimming the outliers and it is 
inspired by the method CNmixt [3] for clustering multivariate data with mild outliers. 
We propose a model for the basis coefficients based on a mixture of contaminated 
multivariate normal distributions. A multivariate contaminated normal distribution 
is a two-component normal mixture in which the bad observations (outliers) are 
represented by a component with a small prior probability and an inflated covariance 
matrix. 

In the next section we present the model and its parsimonious variants. Parameter 
estimation is included in Section 3. In Section 4 we present applications to simulated 
and real-world data. The last section includes the conclusions. 


2 The Model 

We suppose that we observe n curves (xj,...,x4) and we want to cluster them 
in K homogeneous groups. For each curve x; we have access to a finite set of 
values xj; = xj(t;jj), where 0 < fj < tio « ... < tim, € T. We assume that the 


observed curves are independent realizations of a L?— continuous stochastic process 
X = (X(t))epo,r| for which the sample paths are in L? [0, T]. To reconstruct the 
functional form of the data we assume that the curves belong to a finite dimensional 
space spanned by a basis of functions {&),...,&,}, so we have the expansion for 
each curve 


p 
xi(t) = 5  vuej(Q). 
j=l 


Here we assume that the dimension p is fixed and known. We consider a model based 
on a mixture of multivariate contaminated normal distributions for the coefficients 
vectors {y1, «c Yn} C RP, yi = (Yil, syn) € RP, i1, 

We suppose that there exists two unobserved random variables Z = (Z1, ..., ZK), 
Y = (Yi,...,Yk) € {0,1}* where Z indicates the cluster membership and Y 
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whether an observation is good or bad (outlier). Zi, = 1 if X € kth cluster and Zę = 0 
otherwise, and Y; = 1 if X € kth cluster and it is a good observation, and Y, = 0 
otherwise. For clustering we need to predict the value z; = (zi1,..., zik) of Z, and 
to determine the bad observations we need to predict the value v; = (vij, ..., Vix) 
of Y for each observed curve x;, i = 1,...,n. 

We consider a set of ng observed curves of the kth cluster with the coefficients 
iyi... yn,] C RP. We assume that {y1,...,¥n,} are independent realizations 
of a random vector I € RP, and that the stochastic process associated with the 
kth cluster can be described in a lower dimensional subspace E^[0, T] c L?[0, T] 
with dimension dg < p and spanned by the first dg elements of a group specific 
basis of functions (9j) j-1,..,4, that can be obtained from (£;] j-1,..., by a linear 
transformation 


poe r F a ae er ASP SJ m sees 


p 
kj = 25 qk, jii, 
l=1 


with an p x p orthogonal matrix Q% = (gx, ji). In [1] for FunHDDC the assumption 
is that the distribution of I for the kth cluster is I ^ N(uk, Xy), Ey = QxAxQr.. 
where 


ak] 0 
0 
0 a 
Ads kd m "xz 
0 
0 bk 

with ax; > be, i = 1,..., dy. We can say that the variance of the actual data in the 
kth cluster is modeled by a5,...,a&4, and the parameter b,, models the variance 


of the noise [1]. 
We follow the approach in [3] and we assume that I' for the kth cluster has the 
multivariate contaminated normal distribution with density 


fi 0k) = axóCyi uk, Xx) + (1 — On) OMS uk, nkEx), (1) 


where o, € (0.5, 1), nk > 1, Ox = (ar, Mk, Ek, Nk}, and (yj; ux, Ue) is the density 
for the p—variate normal distribution N (ug, Xx): 


z 2 1 - 
rs ui, Ex) = Qn) PPE" exp -50i -uD Xx i-o) 


Here a, defines the proportion of uncontaminated data in the kthe cluster and 7, 
represents the degree of contamination. We can see 7, as an inflation parameter that 
measures the increase in variability due to the bad observations. 

Each curve x; has a basis expansion with coefficient y; such that y; is a random 
vector whose distributions is a mixture of contaminated Gaussians with density 
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K 

p(y:0) = X nf (y; 61) (3) 
k=1 


where mg = P(Zy = 1) is the prior probability of the kth the cluster and 
0- Uk 1 (0k U {7k }) is the set formed by all the parameters. We refer to this model as 
FCLM[ax;, b, Ox, dx] (functional contaminated latent mixture). As in [1] we con- 
sider the parsimonious sub-models: FCLM[a; j, b, Ox, dx], FCLM[ax, bk, Ox, dx], 
FCLM[a, bz, Ox, dx), FCLM[ag, b, Ox, d], FCLM[a, b, Ox, dx]. 


3 Model Inference 


To fit the models we use the ECM algorithm [3], which is a variant of the EM 
algorithm. In the ECM algorithm we replace the M-step in the EM algorithm by two 
simpler CM-steps given by the partition of the set with the parameters 0 = (V, Y2}, 
where V, = (ti Wk, Hk, kj, Dk, 4kj, k =1,...,K,j=1,..., dg}, Vo = (gk = 
1,...,K}, and qzy is the jth column of Q;. 

We have two sources of missing data: the clusters' labels and the type of observa- 
tion (good or bad). Thus the complete data are given by S = (yi, zi, Vi}i=1,....n, and 
the complete-data likelihood is 


asses 


K 
| Hoc largi ur EV E = oer ne nE) ^ 
k- 


—. 


L.(0; S) = 


ll 
— 
— 


i 


We denote the complete-data log-likelihood by le (0; S) = log(Le (0; S)). 
Next we present the ECM algorithm for the model FCLM[a;j, bk, Qk, dy]. At 
the q iteration of the ECM algorithm in the E-step we calculate E [le (0/47? ; S)|y;, 
.., Yn, 047! ], given the current values of the parameters 6‘7-!), This reduces to 
the calculation of zd = = E[Zik|yi, 6479 ], yi = E[Yielvi, zi, 0/479 ]. 
In the first CM step in the q iteration of the ECM algorithm we calculate y? as 
the value of V, that maximize I4 P with Y, fixed at go . We obtain 


) y (a 
y^ (q) yn (a) y, (a) Diet zg E yid * rur Vi 
(q) _ 4i=1 Zik (q) _ Zi Zik Vik (q) _ "k 4) 
VI n MEUS NE zu Ba n (a) { a R 
it XL Zik E ik T id 
» vn 

zO = - = » z2 [res (a a k =] (yi - uP) (yi - ey (5) 

pine 1 Zik i-l 1g 


We introduce a value a* and we constrain œg € (a*, 1). If the estimation a? in 


(4) is less than a*, we use the optimize() function in the stats package in R to doa 


numerical search for a? ; 
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As in [1] we get the updated values a p QR =1,...,K,j=1,...,dx 


from the sample covariance matrix x of cluster k, using also the matrix of 
inner products between the basis functions W = (wyi)i<j,i<p, Where wj; = 


T 
fh &DO&(0adt. 
In the second CM step in the q iteration of the ECM algorithm we calculate a? 


as the value that maximize [4 P with V, fixed at ya : 
At the end of the ECM algorithm, we do a two-step classification to provide the 
expected clustering. If gy is the last iteration of the algorithm before convergence, 


an observation y; € R? is assigned to the cluster ko € (1,..., K} with the largest 
if vee > 0.5, and it is considered bad otherwise. After the classification step we 
can eliminate the bad observations and run FunHDDC to re-cluster the remaining 
observations. 

The class specific dimension d; is selected through the scree-test of Cattell by 
comparison of the difference between eigenvalues with a given threshold [1]. The 
number of clusters K as well as the parsimonious model are selected using the BIC 


criterion. 


. Next, an observation y; that was assigned to the cluster ko is considered good 


4 Applications 
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Fig. 1 Smooth data simulated without oultiers (a), according to scenario A (b), scenarion B (c), 
and scenario C (d), coloured by group for one simulation. 


We simulate 1000 curves based on the model FCLM|ax, bk, Ox, di]. The number 
of clusters is fixed to K = 3 and the mixing proportions are equal z, = 7t? = 73 = 1/3. 
We consider the following values of the parameters 


Group 1: d = 5, a = 150, b = 5, uw = (1,0,50,100,0,...,0) 
Group 2: d = 20, a = 15, b = 8, u = (0,0, 80, 0, 40, 2,0, ...,0) 
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Group 3: d = 10, a = 30, b = 10, u = (0,...,0, 20, 0, 80, 0,0, 100), 


where d is the intrinsic dimension of the subgroups, u is the mean vector of size 70, 
a is the value of the d-first diagonal elements of A, and b the value of the 70 — d- last 
ones. Curves are smoothed using 35 Fourier basis functions. We repeat the simulation 
100 times. A sample of theses data is plotted in Figure 1 a. We consider the following 
contamination schemes where the scores are simulated from contaminated normal 
distributions with the previous parameters and 


A: a; 209, i 2 1,...,3, and jj 2 7, m = 10, ņ = 17. 
B: a; =0.9,i=1,...,3, and 7, = 5, m = 50, ņ3 = 15. 
C: a; 20.9, i 2 1,...,3, and yj = 100, 72 = 70, 73 = 170. 


Samples for data generated according to scenarios A, B, C are plotted in Figure 1 
b, c, d, respectively. We notice that there is more overlapping between the 3 groups 
when we increase the values of 77. 


Table 1 Mean (and standard deviation) of ARI for BIC best model on 100 simulations. Bold values 
indicates the highest value for each method. 


Scenario Method a* € ARI ARI Outliers 
A FunHDDC - 0.05 0.519 (0.11) - 

A FunHDDC - 0.1 0.499(0.05) - 

A FunHDDC - 0.2 0.494 (0.01) - 

A CFunHDDC 0.75 0.05 0.769 (0.23) 0.959(0.04) 
A CFunHDDC 0.75 0.1 0.986(0.08) 0.998(0.01) 
A CFunHDDC 0.75 0.2 0.9995 (0.001) 1 (0) 

B FunHDDC - 0.05 0.861 (0.23) - 

B FunHDDC - 0.1 0.754(0.25) - 

B FunHDDC - 0.2 0.52 (0.09) - 

B CFunHDDC 0.75 0.05 0.807 (0.22) 0.961(0.05) 
B CFunHDDC 0.75 0.1 0.948 (0.14) 0.99(0.03) 

B CFunHDDC 0.75 0.2 0.990 (0.062) 0.971 (0.149) 
C FunHDDC - 0.05 0.490 (0.02) - 

C FunHDDC - 0.1 0.491(0.02) - 

C FunHDDC - 0.2 0.494 (0.01) - 

C CFunHDDC 0.75 0.05 0.736 (0.23) 0.928(0.10) 
C CFunHDDC 0.75 0.1 0.911 (0.18) 0.958(0.15) 
C CFunHDDC 0.75 0.2 0.965 (0.11) 0.994 (0.03) 


The quality of the estimated partitions obtained using FunHDDC and CFunHDDC 
is evaluated using the Adjusted Rand Index (ARI) [3], and the results are included in 
Table 1. For FunHDDC we use the library funHDDC in R. We run both algorithms 
for K = 3 with all 6 sub-models and the best solution in terms of the highest BIC 
value for all those submodels is returned. The initialization is done with the k-means 
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Table 2 Correct classification rates for each method. 


Method € CCR Method a* € CCR Method a* CCR 


FunHDDC 0.01 0.68 CFunHDDC 0.85 0.01 0.67 CNmixt 0.5 0.67 
FunHDDC 0.05 0.64 CFunHDDC 0.85 0.05 0.70 CNmixt 0.75 0.66 
FunHDDC 0.1 0.59 CFunHDDC 0.885 0.1 0.70 CNmixt 0.85 0.67 
FunHDDC 0.2 0.57 CFunHDDC 0.85 0.2 0.6  CNmixt 0.9 0.66 


strategy with 50 repetitions, and the maximum number of iterations is 200 for the 
stopping criterion. We use e € (0.05,0.1,0.2) in the Cattell test. 

We notice that CFunHDDC outperforms FunHDDC, and it gives excellent results 
even in Scenario C. For CFunHDDC the best results are obtained for e — 0.2 in the 
Catell test, and the values of the ARI are close to 1. 

Next, we consider the NOx data available in the fda.usc library in R and repre- 
senting daily curves of Nitrogen Oxides (NOx) emissions in the neighborhood of 
the industrial area of Poblenou, Barcelona (Spain). The measurements of NOx (in 
ug/m?) were taken hourly resulting in 76 curves for “working days" and 39 curves 
for “non-working days" (see Figure 2 a). Since NOx is a contaminant agent, the 
detection of outlying emission is useful for environmental protection. This data set 
has been used for testing methods for the detection of outliers and to illustrate robust 
clustering based on trimming for functional data [4]. 


We apply CFunHDDC, FunHDDC, and CNmixt to the NOx data. Curves are 
smoothed using a basis of 8 Fourier functions, and we run the algorithms for K = 2 
clusters. For CFunHDDC, FunHDDC we use e € {0.001,0.05,0.1,0.2} in the 
Cattell test and the rest of the settings are the same as in the simulation study. We 
run CNmixt for all 14 models from the ContaminatedMixt R library, based on the 
coefficients in the Fourier basis, with 1000 iterations for the stopping criteria, and 
initialization done with the k-means method. The correct classification rates (CCR) 
are reported in Table 2. 

The CCR for CFunHDDC are slightly better than the ones for FunHDDC and 
CNmixt, and are comparable with the ones reported in Table 1 in [4] for Funclust, 


400 


3 
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Fig.2 a.Daily NOx curves for 115 days; b. c. Clustering obtained with CFunHDDC, e = 0, 05,a* = 
0.85; Non-working days (blue), working days (red), outliers (green). 
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RFC, and TrimK. In Figure 2 b, c we present the clusters and the detected outliers 
for e = 0.05 and a* = 0.85. The curves that are detected as outliers (green lines) 
exhibit different patterns from the rest of the curves. 

One of the advantages of extending the FunHDDC to CFunHDDC is the outlier 
detection. For a* = 0.85 and e = 0.05, CFunHDDC detects 16 outliers, which are the 
same with the outliers mentioned in [4]. For the data without outliers, CFunHDDC 
becomes equivalent to FunHDDC, and for the trimmed data the CCR increases to 
0.79. 


5 Conclusion 


We propose a new method, CFunHDDC, that extends the FunHDDC functional clus- 
tering method to data with mild outliers. Unlike other robust functional clustering 
algorithms, CFunHDDC does not involve trimming the data. CFunHDDC is based 
on a model formed by a mixture of contaminated multivariate normal distributions, 
which makes parameter estimation more difficult than for FunHDDC, so we use an 
ECM instead of an EM algorithm. The clustering and outlier detection performance 
of CFunHDDC is tested for simulated data and the NOx data and it always out- 
performs FunHDDC. Moreover, CFunHDDC has a comparable performance with 
robust functional clustering methods based on trimming, such as RFC and TrimK, 
and it has similar or better performance when compared to a two-step method based 
on CNmixt. Although there are several model-based methods for multivariate data 
with outliers that can be used to construct two-step methods for functional data, as 
observed in [1], these two-step methods always suffers from the difficulty to choose 
the best discretization. CFunHDDC can be extended to multivariate functional data, 
and recently, independently of our work, a similar approach was followed in [5], but 
without considering the parsimonious models and the value c*. 
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A Trivariate Geometric Classification of 
Decision Boundaries for Mixtures of Regressions 


Filippo Antonazzo and Salvatore Ingrassia 


Abstract Mixtures of regressions play a prominent role in regression analysis when 
it is known the population of interest is divided into homogeneous and disjoint 
groups. This typically consists in partitioning the observational space into several 
regions through particular hypersurfaces called decision boundaries. A geometrical 
analysis of these surfaces allows to highlight properties of the used classifier. In 
particular, a geometrical classification of decision boundaries for the three most 
used mixtures of regressions (with fixed covariates, with concomitant variables and 
random covariates) was provided in case of one and two covariates, under Gaussian 
assumptions and in presence of only one real response variable. This work aims to 
extend these results to a more complex setting where three independent variables are 
considered. 


Keywords: mixtures of regressions, decision boundaries, hyperquadrics, model- 
based clustering 


1 Introduction 


Linear regression is commonly employed to model the relationship between a d- 
dimensional real vector of covariates X and areal response variable Y. It is well suited 
if we can assume that regression coefficients are fixed over all possible realizations 
(x, y) € R?*! of the couple (X,Y). This assumption falls if it is a-priori known 
that realizations come from a population Q which can be partitioned into G disjoint 


Filippo Antonazzo (È<) 
Inria, Université de Lille, CNRS, Laboratoire de mathématiques Painlevé 59650 Villeneuve d'Ascq, 
France, e-mail: filippo.antonazzo@inria. fr 


Salvatore Ingrassia 
Dipartimento di Economia e Impresa, Università di Catania, Corso Italia 55, 95129 Catania, Italy, 
e-mail: salvatore.ingrassiaQunict.it 


© The Author(s) 2023 21 
P. Brito et al. (eds.), Classification and Data Science in the Digital Age, 

Studies in Classification, Data Analysis, and Knowledge Organization, 
https://doi.org/10.1007/978-3-031-09034-9 3 


22 F. Antonazzo and S. Ingrassia 


homogeneous groups Qg, g = 1,..., G. In this case, a mixture of linear regressions 
(or clusterwise regression) is a more appropriate statistical tool. According to their 
degree of flexibility and generality, we can distinguish three types of mixtures of 
regressions: mixtures of regressions with fixed covariates (MRFC) [3]; mixtures of 
regressions with concomitant variables (MRCV) [6] and mixtures of regressions 
with random covariates (MRRC), also referred to in literature as cluster- weighted 
models [3, 4]. 

Mixtures of regressions can also be employed from a classification point of view 
to identify the group membership of each observation. In this case, the generated 
classifier divides the real space into G regions through particular R?*! surfaces 
called decision boundaries. In [5], the decision boundaries generated by each type 
of mixture are analyzed from a geometrical point of view, especially in those cases 
where d — 1,2 and G - 2. The aim of the present work is to extend the results 
presented in the aforementioned paper to a higher dimensional case where d = 3, 
giving more insight into the properties of these classifiers. The rest of the paper is 
organized as follows. In Section 2 we summarize the main ideas about mixtures of 
regressions. In Section 3 decision boundaries will be defined, finally proposing a 
geometrical classification in Section 4 when d = 3 and G = 2. In Section 5, we 
will conclude investigating with practical example the shape of three-dimensional 
decision boundaries in presence of variables following heavy-tailed t-distributions. 


2 Mixtures of Regressions 
Below we briefly define three types of mixtures of regressions, ordered according to 
their generality and flexibility, given by an increasing number of parameters. 


MRFC. Mixtures of regressions with fixed covariates have the following density: 


G 
pOl vo) =D) Tg f lx Og). (1) 


g-l 


The density f(y|x; 05) is indexed by a parameter vector 6, belonging to an Euclidean 
parametric space O4. Moreover, every 7 is positive and 2:1 Tg = 1. The vector 
y = (711,..., 7G, 01, ..., 0G) denotes the set of all the parameters of the model. 


MRCV. The density of a mixture of regressions with concomitant variables is: 


G 
pOl v) = $, FON; Og) P(Qelx: a), Q) 
g-l 
where the vector y = (81, ..., 05,0) contains all parameters indexing the model. 


More specifically, p(Q,|x; œ) is a function depending on x according to a vector 
of real parameters œ. Typically, the probability p(Qg|x; a) is a multinomial logistic 
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G 
8-l 


p(Qslx:o) = = 
exp(@go + a^ iX) 


Due to identifiability reasons, it is necessary to add the constraint a; = 0, see [2]. 


MRRC. Mixtures of regressions with random covariates propose the following 
decomposition for the conjoint density p(x, y; v): 


G 
p(x, YW) =D) FOIX, 09) PG Ee), (3) 
g-l 


where mg > 0 and 2g 7g — l. Furthermore, the model is totally parametrized 
by the vector y = (71,..., 76,01, ..., 0G, £1, ..., £g), where each 6, indexes the 
conditional density f(y|x, 04), while each £, refers to the density of X in the group 
Qg, denoted with p(x; £2). 

In particular, under Gaussian assumptions it results Y |x, X, ~ N(Bgot+ B, 15% TŽ), 
where each 8, = (820, Bg1) is a vector of real parameters. Only for MRRC model, we 
will further assume X|Q, ~ N(ug, X4) for all g = 1,...,G, where ug denotes the 
mean of the Gaussian distribution, while 2, is its covariance matrix. Denoting with 
¢(-) the Gaussian density function, equations (1)-(3) can be, respectively, rewritten 
as 


G 
POI: b) = $, 6 Bgo + BL x og), (4) 
g-l 
G 
pOl v) = $, 90; Bao + By 1X 7) P(Qelx; o), (5) 
g-l 
G 
p(x, y: i) = $, 6. Boo + Br: 02) O(% Mg, Eg rtg (6) 
g-l 


Maximum likelihood estimate for y are usually obtained with the Expectation- 
Maximization (EM) algorithm. Then, the final estimate is used to build classifiers 
which group observations into G disjoint classes. 


3 Decision Boundaries: Generality 


There are different ways to build classifiers. One of the best known is the method of 
discriminant functions. The aim of this procedure is to define G functions D g(x, y; y) 
and a decision rule to divide the real space R7*! into G decision regions, named 
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Ri,..., Rg. The decision regions have a one-to-one relationship with the subgroups 
Qg, i.e., if an observation (x, y) € RÆ! is assigned to Rg, it is classified as part 
of Q,. Among all possible decision rules, the most used one consists in assigning 
(x, y) to Rg if: 

Dg(x,. yw) > Dixy) Vj * g. (7) 


Then, decision boundaries are defined as the surfaces in R¢*! separating the de- 
cision regions Rg, where observations cannot be uniquely classified. Formally, 
each decision boundary is a hypersurface represented by the mathematical equa- 
tion Dj (x, y; v) - Dk(x. yv) 20, j +k. 

Different choices for discriminant functions are possible: under Gaussian assump- 
tions it is convenient to define D,(-) as the logarithm of the g-th component mixture 
density, as it conveys useful computational simplification [5]. So, we can define, for 
all the three models, these discriminant functions: 


MRFC: Dg(x y; Y) = In[Cy; Bgo + Be X Fe) (8) 
MRCV: Deg(x, y; Y) -In[ó(y; Bgo + Paix, Tg) exp(ogo + o4,x)] — (9) 
MRRC: Dg(xy;V)-In[$(;fao* B, X Tg) O(% Hg, Xg)ns] — (10) 


3.1 The Case with G = 2 


In the case of interest where G = 2, there is a single decision boundary defined by 
the equation D(x, y; v) = D2(x, y; v) — Dı (x, y; Y) = 0. Thus, the assignment rule 
for every point (x, y) € R”! is based on the sign of D(x, y; y). It assigns (x, y) to 
Q, if D(x, y; y) > 0; to Q4, otherwise. 

In [5] the geometrical properties of the hypersurfaces, defined by the equation 
D(x, y; v) = 0, have been investigated up to dimension d = 2, providing the follow- 
ing propositions for quadrics. 


Proposition 1 (MRFC quadrics) The decision boundary between Q; and Q3 is 
always a degenarate quadric. 


Proposition 2 (MRCV quadrics) Jf a‘ (82; — B11) # 0, then the decision boundary 
between Q; and Q, is a paraboloid; otherwise it is a degenarate quadric. 


Proposition 3 (MRRC quadrics) Under convenient conditions, the decision bound- 
ary between Q; and €)»; can be a degenerate quadric but it can be also assume any 
of the general quadric forms. 


These results show that models with more flexibility, i.e. with more parameters, 
can generate more varieties of decision boundaries. In the following section, we will 
extend these statements to dimension d = 3. 
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4 Geometrical Classification of Decision Boundaries with G — 2 
and D 23 


In this section we extend previous results for mixtures of regression in presence of 
two classes and d — 3, where decision boundaries reveal to be hyperquadrics in 
R^. Mathematical proofs of results for MRFC and MRCV models are based on an 
algebraic analysis of the matrices representing these hyperquadrics. 


MRFC. Mixtures of regressions with fixed covariates are characterized by a low 
degree of flexibility. Indeed, all decision boundaries are degenerate hyperquadrics 
as the following result shows. 


Proposition 4 (MRFC hyperquadrics) The decision boundary between Q; and Q2 
is a degenerate hyperquadric of rank at most equal to 3. The rank is less than 3 if 
Pu = Pai or Lm i 


MRCV. A MRCV allows more degrees of freedom than a MRFC. A consequence is 
that the obtained decision boundaries are higher rank hyperquadrics, as the following 
result states. 


Proposition 5 (MRCV hyperquadrics) The decision boundary between Q; and Q2 
is a degenerate hyperquadric with rank at most equal to 4. In particular, rank is 
equal to 4 if a! (B21 — B11) # 0. In addition, if a! (B21 — 11) = 0 and c? = 0 the 
matrix has rank at most equal to 2, therefore the hyperquadric is reducible. 


MRRC. Proposition 3 shows MRRC exhibit a high number of possible types of 
conics and quadrics [5]. This fact is confirmed in dimension d = 3, even if a strong 
theoretical result is difficult to obtain with simple algebra due to the mathemati- 
cal complexity of the MRRC hyperquadric matrix. Indeed, it is possible to show 
such flexibility by building several practical examples (not displayed here), where 
hyperquadrics of various shapes arise. 


Analyzing the provided results, we can note that they perfectly match the hierar- 
chy established in dimension d = 2. Indeed, a MRFC can generate only degenerate 
hyperquadrics of rank 3; the surfaces generated by a MRCV, which has more param- 
eters, are still degenerate, but with a higher rank (equal to 4) depending on the same 
mathematical condition of Proposition 2; finally a MRRC, the most flexible model in 
terms of number of parameters, can give rise to various hyperquadrics, as in d = 2. 


5 Beyond Gaussian Assumptions: ¢-distribution in d = 2 


In [5], Gaussian assumptions were crossed by illustrating the case of a simple linear 
regression (G = 2 and d = 1) where more general t-distributions were required 
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for robustness reasons. It is shown that the generated decision boundaries are more 
flexible than their Gaussian counterparts, as they can assume more various shapes, 
although these surfaces can be calculated only numerically. In this section, we 
continue the exploration of the f-distribution case adding one more variable, thus 
d = 2. Under these more general assumptions, discriminant functions (8) — (10) 
become: 


MRFC-t:. Dg(x, yi) = In[q(y; Bgo + B X, 07. Ng)Tg], (11) 
MRCV-t: Dg(x, y; Y) = In[q(y; Bgo + BuiX, ur Neg) EXp(aeo + wX)], (12) 
MRRC-t: Do(x yi) = In[q(y: Bgo + 615, c. ng)q (X; Hg,Mg.vg)ng], (13) 


where q(y; Bgo- By 15; ož, ng) denotes a generalized t-distribution density, with non- 


centrality parameter equal to £50 + Bi scaling parameter equal to o? and degrees 
of freedom given by ny. Similarly, q (X; ug, Lg, Yg) is a multivariate generalized 
t-distribution density, where ug is the non-centrality parameter, X, denotes the 
scaling and v, represents the degrees of freedom. Figure 1-2 display the decision 
boundaries for the three considered models whose parameters are presented in Table 
1: they clearly show the gain in flexibility given by the more general distributional 
assumptions. Moreover, t-boundaries with 7, = 72 = 10 (Figure 2; red curves) seem 
to be closer to Gaussian ones (blue curves) than those with 7; = 72 = 3 (Figure 1; 
orange curves): this is coherent with standard probabilistic theory. 


Table 1 Parameters used in Figure 1-2. MRRC: covariance matrices Z4 and X» are equal to the 
identity matrix In. 


Model Group ag Bgo Bei o QgQ AI Hg Yg 
MRFC 1 03 1 (2,-3) 0.5 
2 07 1 (4,3) 0.5 
MRCV 1 03 1 (2,-3) 0.5 
2 07 1 (4,3) 0.5 1 CL0.5) 
MRRC 1 03 1 (2,-3) 0.5 (L2) 5 
2 07 1 (-4,3) 0.5 1 (1,0.5) (-1,-2) 5 


6 Conclusions 


This work has provided a trivariate multivariate geometrical classification for the 
decision boundaries generated by mixtures of regressions in presence of two classes. 
Under Gaussian assumptions, our results confirmed the same hierarchy that was 
shown in d = 2, as MRRC turns out to exhibits a huge variety of decision boundaries, 
while other models generate only degenerate surfaces. This is coherent with its high 
degree of flexibility given by its very general parametrization. The provided results 
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(a) MRFC (b) MRCV 


(c) MRRC 


Fig. 1 Decision boundaries under assumptions of Gaussian (in blue) and ¢-distributed variables 
with 77; = 772 = 3 (in orange) for the three considered mixtures of regressions. 


(a) MRFEC (b) MRCV 


(c) MRRC 


Fig. 2 Decision boundaries under assumptions of Gaussian (in blue) and f-distributed variables 
with 77; = 772 = 10 (in red) for the three considered mixtures of regressions. 


could help to select the right model depending on the shape of data. For example, 
if in a descriptive analysis data turn out to be approximately separated by a simple 
degenerate hyperquadric, it will be better to estimate a MRFC or a MRCV instead 
of a complex MRRC. On the contrary, if the separation surface seems to be non- 
degenerate, then it will be preferable to fit a general MRRC. Moreover, this work 
also showed that the degree of flexibility (thus, the variety of possible decision 
boundaries) can be enhanced by go further Gaussianity, assuming, for example, 
t-distributed variables. This encourage additional extensions where more general 
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distributions can be included, allowing a better comprehension of mixtures and 
possible applications to generalized linear models where categorical variables are 
considered. 
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Generalized Spatio-temporal Regression with 
PDE Penalization 


Eleonora Arnone, Elia Cunial, and Laura M. Sangalli 


Abstract We develop a novel generalised linear model for the analysis of data dis- 
tributed over space and time. The model involves a nonparametric term f, a smooth 
function over space and time. The estimation is carried out by the minimization of an 
appropriate penalized negative log-likelihood functional, with a roughness penalty 
on f that involves space and time differential operators, in a separable fashion, or 
an evolution partial differential equation. The model can include covariate informa- 
tion in a semi-parametric setting. The functional is discretized by means of finite 
elements in space, and B-splines or finite differences in time. Thanks to the use of 
finite elements, the proposed method is able to efficiently model data sampled over 
irregularly shaped spatial domains, with complicated boundaries. To illustrate the 
proposed model we present an application to study the criminality in the city of 
Portland, from 2015 to 2020. 


Keywords: functional data analysis, spatial data analysis, semiparametric regression 
with roughness penalty 
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1 Introduction 


In this work we develop a novel generalised linear model for the analysis of data 
distributed over space and time. Let Y be a real-valued variable of interest, and W a 
vector of q covariates, observed in n spatio-temporal locations (pi, ti }i=1,....n € QXT, 
where Q c R? is a bounded spatial domain, and T C R a temporal interval. We 
assume that the expected value of Y, conditional on the covariates and the location 
of observation, can be modeled as: 


g(ELY|W, p, t]) = W' B+ f (p.t) 


where g is a known monotone link function, chosen on the basis of the stochastic 
nature of Y, B € R? is an unknown vector of regression coefficients, and f : QXT > 
Ris an unknown deterministic function, which captures the spatio-temporal variation 
of the phenomenon under study. Starting from the values (y;, W;}i=1,....n, of the 
observed response variable and covariates, we estimates B and f in a semiparametric 
fashion. In particular, following the approach in [9], that consider a similar problem 
for data scattered over space only, we minimize the functional 


where f is the appropriate negative log-likelihood, and P ( f) is a penalty that enforces 
f to be a regular function. 

Similarly to the regression methods in [1, 2, 3, 4, 5, 7, 8], the roughness penalty 
on f, P(f), involves some partial differential operators. In particular, our aim is 
to extend the Spatial-Temporal regression with partial differential equations regu- 
larization (ST-PDE), developed in [2, 3, 4], to generalized linear model settings, 
further broadening the class of regression models with PDE regularization reviewed 
in [6]. Hence, likewise ST-PDE, the proposed generalized linear model has a rough- 
ness penalty that involves a second order linear differential operator L applied to f. 
Specifically, as in [4], we may consider the penalty 


euo sa f [ (ZL) as [ ar, 


where the first term accounts for the regularity of the function in time, while the 
second accounts for the regularity of the function in space; the importance of each 
term is controlled by two smoothing parameters Ar and As. Alternatively, as in 
[2], we may consider a single penalty which accounts for the spatial and temporal 


regularity: » , 
" of - 
pinaal f Gaz u) . 


Differently from the models in [2, 3, 4], the estimation functional to be minimized 
is not quadratic. This poses increased difficulties from the computational point 
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of view. The minimization is performed via a functional version of the penalized 
iterative reweighted least square algorithm. 

The estimation problem is appropriately discretized. In particular, in time, the 
discretization involves either cubic B-splines, for the two-penalties case, or finite 
differences, when the single penalty is employed. The discretization in space is 
performed via finite elements, on a triangulation of the spatial domain of interest. 
This enables to appropriately considered spatial domains with complicated bound- 
aries, such as the one considered in the following section, concerning the study of 
criminality data over the city of Portland. 


2 Application to Criminality Data 


This section describes the Portland criminality data, that will be used to illustrate 
the proposed methodology. We will present a Poisson model to count the crimes in 
the city, and study their evolution from April 2015 to November 2020. In addition, 
we shall consider as a covariate the population of the city neighborhoods. The crime 
data are publicly available on the website of the Police Bureau of the city!. 

The crimes counts are aggregated by trimesters and at a neighborhoods level. 
Figure | shows the city neighborhoods, each neighborhood colored according to its 
total population. The bottom part of the same figure shows the temporal evolution 
of the crimes in each neighborhood. Each curve corresponds to a neighborhood 
and is colored according to the neighborhood population. In both panels, the three 
neighborhoods with the highest number of crimes are indicated by numbers 1, 2 
and 3. The figure highlights the presence of some correlation between neighborhood 
population and the number of crimes. However, criminality is not fully explained by 
population. For instance, neighborhoods | and 3 present an high number of crimes 
with a moderate population. This raises the interest towards a semiparametric 
generalized linear model, as the one introduced in Section 1, with a nonparametric 
term accounting for the spatio-temporal variability in the phenomenon, that cannot 
be explained by population or other census quantities. Figure 2 shows the same data 
for four different trimesters on the Portland map. As already pointed out, the three 
area with the highest number of crimes are in the city center, and in the Hazelwood 
neighborhood, in the east part of the city. 

From Figures 1 and 2 we can see that the shape of the domain is complicated; the 
city is indeed crossed by a river, with few bridges connecting the two parts, most of 
them placed downtown. Therefore, neighborhoods at opposite side of the river and 
far from the center, where most bridges are located, are close in euclidean distance, 
but far apart in reality. This particular morphology influences the phenomenon under 
study, for example, in the north of the city, the east side of the river is characterized by 
an higher number of crimes with respect to the west part. Due to this characteristics 
of the data and the domain, is is of crucial importance to take into account the shape 


1 Police Bureau crime data: https: //www.portlandoregon. gov/police/71978 
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Fig. 1 Top: the city of Portland divided into neighborhoods, each neighborhood colored according 
to the total population. Bottom: the total crimes over time for each neighborhood; each curve 
corresponds to a neighborhood and is colored according to the neighborhood's population. The 
three neighborhoods with the highest number of crimes are indicated by numbers 1, 2 and 3. 
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Fig. 2 Total crime counts per neighborhood per trimester; green indicates lower number of crimes, 
red indicates a higher number of crimes. 


of the domain during the estimation process. For this reason, estimation based on 
classical semiparametric models, such as those based on thin-plate splines, would 
give poor results, while the proposed method is particularly well suited, being able 
to complying the nontrivial form of the domain. 
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A New Regression Model for the Analysis of 
Microbiome Data 


Roberto Ascari and Sonia Migliorati 


Abstract Human microbiome data are becoming extremely common in biomed- 
ical research due to the relevant connections with different types of diseases. 
A widespread discrete distribution to analyze this kind of data is the Dirichlet- 
multinomial. Despite its popularity, this distribution often fails in modeling micro- 
biome data due to the strict parameterization imposed on its covariance matrix. The 
aim of this work is to propose a new distribution for analyzing microbiome data 
and to define a regression model based on it. The new distribution can be expressed 
as a structured finite mixture model with Dirichlet-multinomial components. We 
illustrate how this mixture structure can improve a microbiome data analysis to 
cluster patients into "enterotypes", which are a classification based on the bacterio- 
logical composition of gut microbiota. The comparison between the two models is 
performed through an application to a real gut microbiome dataset. 


Keywords: count data, Bayesian inference, mixture model, multivariate regression 


1 Introduction 


The human microbiome is defined as the set of genes associated with the micro- 
biota, i.e. the microbial community living in the human body, including bacteria, 
viruses and some unicellular eukaryotes [1, 8]. The mutualistic relationship be- 
tween microbiota and human beings is often beneficial, though it can sometimes 
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become detrimental for several health outcomes. For example, changes in the gut 
microbiome composition can be associated with diabetes, cardiovascular disesase, 
obesity, autoimmune disease, anxiety and many other factors impacting on human 
health [1, 5, 12, 14]. Moreover, the development of next-generation sequencing 
technologies allows nowadays to survey the microbiome composition using direct 
DNA sequencing of either marker genes or the whole metagenomics, without the 
need for isolation and culturing. These are the two main reasons for the recent ex- 
plosion of research on microbiome, and highlight the importance of understanding 
the association between microbiome composition and biological and environmental 
covariates. 

A widespread distribution for handling microbiome data is the Dirichlet- 
multinomial (DM) (e.g., see [4, 16]), a generalization of the multinomial distribution 
obtained by assuming that, instead of being fixed, the underlying taxa proportions 
come from a Dirichlet distribution. This allows to model overdispersed data counts, 
that is data showing a variance much larger than that predicted by the multinomial 
model. Despite its popularity, the DM distribution is often inadequate to model 
real microbiome datasets due to the strict covariance structure imposed by its pa- 
rameterization, which hinders the description of co-occurrence and co-exclusion 
relationships between microbial taxa. 

The aim of this work is to propose a new distribution that generalizes the DM, 
namely the flexible Dirichlet-multinomial (FDM), and a regression model based on 
it. The new model provides a better fit to real microbiome data, still preserving a 
clear interpretation of its parameters. Moreover, being a finite mixture with DM 
components, it enables to account for the data latent group structure, and thus to 
identify clusters sharing similar biota compositions. 


2 Statistical Models for Microbiome Data 


In this section, we define a new distribution for multivariate counts and a regression 
model based on it, that allows to link microbiome abundances with covariates. Note 
that, once the DNA sequence reads have been aligned to the reference microbial 
genomes, the abundances of microbial taxa can be quantified. Thus, microbiome data 
represent the count composition of D bacterial taxa in a specific biological sample, 
and a microbiome dataset is a sequence of D-dimensional vectors Y1, Yo, ..., Yy, 
where Y;, counts the number of occurrences of taxon r in the i-th sample (i = 
L...,Nandr = l,..., D). Since the i-th sample contains a number n; of bacteria, 
microbiome observations are subject to a fixed-sum constraint, that is ye 1 Yir = hj. 
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2.1 Count Distributions 


Following a compound approach, we assume that Y|II = x ~ Multinomial(n, z), 
and we consider suitable distributions for the vector of probabilities II € SP. The 
set SP = {rn = (m,...,2p)T : m, > 0, y» 7, = 1} is the D-part simplex and 
it is the proper support of continuous compositional vectors. A distribution for Y is 
obtained by marginalizing the joint distribution of (Y, II)T. A common choice for 
this distribution is the mean-precision parameterized Dirichlet, whose probability 
density function (p.d.f.) is 


I(a*) 2 CE 
Jost p, 0+) = —— A mee os 
ne jECo* us) [I 


where u = E[II] € SP, and a* > 0 is a precision parameter. Compounding the 

multinomial distribution with the Dirichlet one leads to the DM distribution, widely 
used in microbiome data analysis, whose probability mass function (p.m.f.) is 

o a MT) Pp ratu +y) 
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The mean vector of a DM distribution is E[Y] = nj, so that the parameter u = 
E[Y]/n can be thought of as a scaled mean vector. Moreover, its covariance matrix 
is 

n-1 


V[Y] = nM 
D ry a*«l 
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where M = (Diag( u) — ug). Equation (1) highlights how the additional parameter 
a* allows to increase flexibility in the variability structure with respect to the standard 
multinomial distribution. 

We propose to take advantage of an alternative sound distribution defined on SP, 
namely the flexible Dirichlet (FD) [7, 9]. The latter is a structured finite mixture with 
Dirichlet components, entailing some constraints among the components' parameters 
to ensure model identifiability. Thanks to its mixture structure, the p.d.f. of a FD- 
distributed random vector can be expressed as 


D + 
(04 
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where 
Aj-cu-wp-we; (3) 


is the mean vector of the j-th component, y = E[H] € SP, at > 0, p e SP, 
0 < w< min fi, min, e(1,..., D) ESI and e; is a vector with all elements equal to 
Zero except for the j-th which is equal to one. 
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Equation (2) points that the Dirichlet components have different mean vectors 
and a common precision parameter, the latter being determined by a* and w. In 
particular, inspecting Equation (3), it is easy to observe that any two vectors 4, and 
An, r € h, coincide in all the elements except for the r-th and the h-th. 

If II is supposed to be FD distributed, a new discrete distribution for count vectors 
can be defined (we shall call flexible Dirichlet-multinomial (FDM)). The p.m.f. of 
the FDM can be expressed as 


D + 
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where 4; is defined in Equation (3). Interestingly, it is possible to recognize the 
flexible beta-binomial (FBB) [3] distribution as a special case of the FDM. The 
FBB is a generalization of the binomial distribution successful in dealing with 
overdispersion. Moreover, note that when p = yw and w = 1/(a* + 1) the DM 
distribution is recovered. 

Equation (4) shows that the FDM is a finite mixture with DM components 
displaying a common precision parameter and different scaled mean vectors Åj, 
j =1,...,D. The overall mean vector and the covariance matrix of the FDM can 
be expressed as 


E[Y] = ny, 
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where M = (Diag(gu) — ppt), P = (Diag(p) - pp), and ¢ = at/(1 — w) is 
the common precision parameter of the DM components. A comparison between 
Equations (5) and (1) points out that the covariance matrix of the FDM distribution 
is a very easily interpretable extension of the DM’s covariance matrix. Indeed, it is 
composed of two terms, the first one coinciding with the DM’s covariance matrix, 
whereas the second one depends on the mixture structure of the FDM model. In 
particular, the FDM covariance matrix has D additional parameters with respect to 
the DM, namely D — 1 distinct elements in the vector of mixing weights p, and the 
parameter w which controls the distance among the components’ barycenters [7]. 
This is the key element explaining the better ability of the FDM in modeling a wide 
range of scenarios. 
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2.2 Regression Models 


With the aim of performing a regression analysis, let Y = (Y1,..., Yy)T bea set of 
independent multivariate responses collected on a sample of N subjects/units. For 
the i-th subject, Y; counts the number of times that each of D possible taxa occurred 
among n; trials, and x; is a (K + 1)-dimensional vector of covariates. 

A parameterization of the FDM useful in a regression perspective is the one based 
on 4, p, a*, and ib, where 
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We can define the FDM regression (FDMReg) and the DM regression (DMReg) 
models assuming that Y; follows an FDM(n;, H;, à*, p, 9) or a DM(n;, H;i, 0^) dis- 
tribution, respectively. Even if the FDM and DM distributions do not belong to the 
dispersion-exponential family, we can follow a GLM-type approach, [6] by linking 
the parameter ju; to the linear predictor through a proper link function such as the 
multinomial logit link function, that is 


ss) =1og( £5) = x76, r=1,...,D-1, (7) 
HiD 
where B, = (Bro, Bri. .... Brg)" is a vector of regression coefficients for the r-th 


element of u;. Note that the last category has been conventionally chosen as baseline 
category, thus Bp = 0. 

The parameterization of the FDMReg based on yp, p, a*, and i? defines a variation 
independent parameter space, meaning that no constraints exist among parameters. In 
a Bayesian framework, this allows to assume prior independence, and, consequently, 
we can specify a prior distribution for each parameter separately. In order to induce 
minimum impact on the posterior distribution, we select weakly-informative priors: 
(i) B, ^ Nx+1(0, X), where 0 is the (K + 1)-vector with zero elements, and X is 
a diagonal matrix with ‘large’ variance values, (ii) a ~ Gamma(g1, g2) for small 
values of gı and go, (iii) à» ~ Uni f (0, 1), and (iv) a uniform prior on the simplex for 


Inferential issues are dealt with by a Bayesian approach through a Hamilto- 
nian Monte Carlo (HMC) algorithm [10], which is a popular generalization of the 
Metropolis-Hastings algorithm. The Stan modeling language [13] allows implement- 
ing an HMC method to obtain a simulated sample from the posterior distribution. 

To compare the fit of the models we use the Watanabe-Akaike information crite- 
rion (WAIC) [15, 17], a fully Bayesian criterion that balances between goodness-of- 
fit and complexity of a model: lower values of WAIC indicate a better fit. 
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3 A Gut Microbiome Application 


Inthis section, we fitthe DM and the FDM regression models to a microbiome dataset 
analyzed by Xia et al. [19] and previously proposed by Wu et al. [18]. They collected 
gut microbiome data on 98 healthy volunteers. In particular, the counts of three 
bacteria genera were recorded, namely Bacteroides, Prevotella, and Ruminococcus. 
Arumugam et al. [2] used these three bacteria to define three groups they called 
enterotypes. These enterotypes provide information about the human's body ability 
to produce vitamins. 

Wu et al. analyzed the same dataset conducting a cluster analysis via the ‘par- 
titioning around medoids’ (PAM) approach. They detected only two of the three 
enterotypes defined in the work by Arumugam et al. Moreover, these two clusters 
are characterized by different frequencies: 86 out of the 98 samples were allocated 
to the first enterotype, whereas only 12 samples were clustered into enterotype 2. 
This is due to the small number of subjects with a high abundance of Prevotella (i.e., 
only 36 samples showed a Prevotella count greater than 0). 

Besides the bacterial data, we consider also K = 9 covariates, representing in- 
formation on micro-nutrients in the habitual long-term diet collected using a food 
frequency questionnaire. These 9 additional variables have been selected by Xia et 
al. using a /; penalized regression approach. 

Table 1 shows the posterior mean and 95% credible set (CS) of each parameter 
involved in the DMReg and the FDMReg models. Though the significant covariates 
are the same across the models, the FDMReg shows a lower WAIC, thus being the 
best model in terms of fit. This is due to the additional set of parameters involved in 
the mixture structure that help in providing information on this dataset. 

The mixture structure of the FDMReg model can be exploited to cluster ob- 
servations into groups through a model-based approach. More specifically, each 
observation can be allocated to the mixture component that most likely generated it. 
Indeed, note that the mixing weights estimates (0.637, 0.357 and 0.006, from Table 
1) confirm the presence of two out of the three enterotypes defined by Arumugam 
et al. [2]. To further illustrate the benefits of the FDReg model in a microbiome 
data analysis, we compare the clustering profile obtained by the FDMReg model and 
the one obtained with the PAM approach used by Wu et al. In particular, Table 2 
summarizes this comparison in a confusion matrix. Despite the clustering generated 
by the FDMReg being based on some distributional assumptions (i.e., the response 
is FDM distributed), it highly agrees with the one obtained by the PAM algorithm for 
84% of the observations. This percentage is obtained using the covariates selected 
by Xia et al. in a logistic normal multinomial regression model context. Clearly, 
the results could be improved by developing an ad hoc variable selection procedure 
for the FDMReg model. The main advantage to considering the FDMReg (that is a 
model-based clustering approach) is that, besides the clustering of the data points, 
it provides also some information on the detected clusters (e.g., their size and a 
measure of their distance) and the relationship between the response and the set of 
covariates. This additional information may increase the insight we can gain from 
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Table 1 Posterior mean and 95% CS for the parameters of the DMReg and FDMReg models. 
Regression coefficients in bold are related to 95% CS's not containing the zero value. 


DM FDM 
Post. Mean 95% CS Post. Mean 95% CS 

Intercept] 2.197 (1.844, 2.546) 2.642 (2.215, 3.034) 

Proline} -0.039  (-0.344, 0.273) | -0.036 (-0.325, 0.261) 

Sucrose} -0.257 = (-0.555, 0.039) | -0.208 = (-0.471, 0.064) 

Vitamin E, food fortification} -0.016 — (-0.351, 0.336) | -0.043  (-0.351, 0.299) 
Beta cryptoxanthin| -0.073  (-0.357,0.237)| -0.059  (-0.334,0.214) 

Added germa from wheats} -0.147 — (-0.477, 0.196) | -0.042 (-0.411, 0.271) 
Vitamin C| 0.300  (-0.031,0.771)| 0.267 . (-0.035, 0.673) 

Maltose| -0.031  (-0.311,0.260)| 0.034 . (-0.237, 0.302) 

Palmitelaidic trans fatty acid} 0.019 — (-0.292,0.328) | -0.044  (-0.336, 0.251) 
Acrylamide| 0.133 . (-0.167,0.455)| 0.184 . (-0.094, 0.474) 

Intercept| -1.196  (-1.715,-0.699)| -0.402  (-1.094, 0.245) 

Proline} -0.053  (-0.571,0.443)| -0.018  (-0.663, 0.546) 

Sucrose} 0.029 (-0.437,0.476) | 0.126 = (-0.335, 0.591) 

Vitamin E, food fortification} 0.109 — (-0.355,0.548) | 0.113 . (-0.473, 0.574) 
Beta cryptoxanthin| 0.263 (-0.230, 0.762) 0.349 . (-0.386, 0.812) 

Added germa from wheats} 0.280 — (-0.137, 0.701) 0.121 (-0.298, 0.604) 
Vitamin C| -0.169  (-1.196,0.623)| -0.021  (-1.131, 0.738) 

Maltose| 0.640 (0.164, 1.126) 0.877 (0.260, 1.400) 
Palmitelaidic trans fatty acid} -0.530 — (-1.008, -0.043)| -0.716 — (-1.209, -0.140) 
Acrylamide} 0.780 (0.362, 1.206) 0.800 (0.382, 1.231) 

a*| 1.541 (1.104, 2.040) 2.275 (1.489, 3.208) 


Bacteroides 


Prevotella 


Pi = i: 0.637 (0.420, 0.797) 

p2 — — 0.357 (0.197, 0.570) 

P3 = = 0.006 (0.000, 0.027) 

Ü — — 0.914 (0.791, 0.991) 
WAIC 1686.2 1662.3 


data. Further improvements could be obtained considering an even more flexible 
distribution for II, that is the extended flexible Dirichlet [11]. 


Table 2 Confusion matrix for clustering based on the FDMReg model compared to the PAM 
algorithm. 
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Stability of Mixed-type Cluster Partitions for 
Determination of the Number of Clusters 


Rabea Aschenbruck, Gero Szepannek, and Adalbert F. X. Wilhelm 


Abstract For partitioning clustering methods, the number of clusters has to be 
determined in advance. One approach to deal with this issue are stability indices. 
In this paper several stability-based validation methods are investigated with regard 
to the k-prototypes algorithm for mixed-type data. The stability-based approaches 
are compared to common validation indices in a comprehensive simulation study in 
order to analyze preferability as a function of the underlying data generating process. 


Keywords: cluster stability, cluster validation, mixed-type data 


1 Introduction 


In cluster analysis practice, it is common to work with mixed-type data (i.e. nu- 
merical and categorical variables), while in theoretical development the research is 
traditionally often restricted to numerical data. A comprehensive overview on cluster 
analysis based on mixed-type data is given in [1]. To cluster these mixed-type data, a 
popular approach is the k-prototypes algorithm, an extension of the popular k-means 
algorithm, as proposed in [2] and implemented in [3]. 

As for all partitioning clustering methods, the number of clusters has to be spec- 
ified in advance. In the past, several validation methods have been identified for the 
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k-prototypes algorithm to enable the rating of clusters and to determine the index 
optimal number of clusters. A brief overview is given in Section 2, followed by an 
examination of the investigated stability indices to improve clustering mixed-type 
data!. In Section 3, a simulation study has been conducted in order to compare the 
performance of stability indices as well as a new proposed adjustment, and addi- 
tionally to rate the performance with respect to internal validation indices. Finally, a 
summary, which does not state a superiority of the stability-based approaches over 
internal validation indices in general, and an outlook are given in Section 4. 


2 Stability of Cluster Partitions 


The assessment of cluster quality can be used for the comparison of clusters resulting 
from different methods or from the same method but with different input parameters, 
e.g., with a different number of clusters. Especially the latter has already been an 
important issue in partitioning clustering many decades ago [5]. Since then, some 
work has been done on this subject. Hennig [6] points out, that nowadays some liter- 
ature uses the term cluster validation exclusively for methods that decide about the 
optimal number of clusters, in the following named internal validation. An overview 
of internal validation indices is given, e.g., in [7] or [8]. In [9], a set of internal cluster 
validation indices for mixed-type data to determine the number of clusters for the 
k-prototypes algorithm was derived and analyzed. In the following, stability indices 
are presented, before they are compared to each other and additionally to internal 
validation indices in Section 3. Since cluster stability is a model agnostic method, 
the indices are applicable to any clustering algorithm and not limited to numerical 
data [10]. 

A partition S splits data Y = {y1,..., Yn} into K groups $1,..., Sg € Y. The 
focus of this paper is on the evaluation and rating of cluster partitions with so- 
called stability indices. To calculate these, as discussed by Dolnicar and Leisch 
[11] or mentioned by Fang and Wang [12], b € (1,..., B} bootstrap samples Y^ 
(with replacement, see e.g. [13]) from the original data set Y are drawn. For every 
bootstrap sample Y^, a cluster partition S^ = (S^, ..., 82.) is determined. For the 
validation of the different results of these bootstrap samples, the set of points from 
the original data set that are also part of the b-th bootstrap sample X^ = Y n Y^ is 
used, where np is the size of X^. Furthermore C^ = (S; N X^|k = 1,..., K} and 
D? = {9} n X^|I 5 1,..., Lp}, with Bc. being the number of bootstrap samples for 
which C^ # 0, and ns,» ncr, Nsb, and npp with k € {1,..., KJ. € {1,..., Lo} 
are the numbers of objects in cluster group Sx, C b ; Sr and D? , respectively. 

In 2002, Ben-Hur et al. [14] presented stability-based methods, which can be used 
to define the optimal number of clusters. In their work, the basis for the calculation 
of the stability indices is a binary matrix pe’ , which represents the cluster partition 
C" in the following way 


1 The mentioned and analyzed stability indices will extend the R package clustMixType [4]. 
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c^ n if objects 35x € X? are in the same cluster and i + j, 


x J 1 
cs 0, otherwise. (Y 


With PP" defined analogously, the dot product of the two cluster partitions C^ and 
D^ is defined as D(PC", PP”) = Jij po p. This leads to a Jaccard coefficient 


based index of two cluster partitions C^ and D^ 


D(PC’, PP”) 


b b 
Staby(P© , PP’) = 
D(PC’, PC”) + p(pD^. PP”) — D( PC’, PP”) 


(2) 


Hennig proposed a so-called local stability measure for every cluster group in a 
cluster partition based on the Jaccard coefficient as well [15]. To obtain one stability 
value Stabj.cy for the whole partition, the weighted mean of the cluster-wise values 
with respect to the size of the cluster groups is determined. Another stability-based 
index presented by Ben-Hur et al., based on the simple matching coefficient, is called 
Rand index [16] and defined as 


D 1 D 
Stab, (PC^, pP^) =1- P^ - PP” |2., (3) 


Additionally, they present the stability index based on a similarity measure, which 
was originally mentioned by Fowlkes and Mallows [17], 


D(PC'", PP”) 
VD(PC”, PC”) D(PP”, PP) 


For determination of the number of clusters, Ben-Hur et al. proposed the analysis of 
the distribution of index values calculated between pairs of clustered sub-samples, 
where high pairwise similarities indicate a stable partition. The authors’ suggested 
aim is examining the transition from a stable to an unstable clustering state. In 
the simulation study, this qualitative criterion was numerically approximated by 
the differences in the areas under these curves. Furthermore, von Luxburg [18] 
published an approach to obtain the cluster partition stability based on the minimal 
matching distance, where the minimum is taken over all permutations of the K labels 
of clusters. Straightforward, the distances are summarized by their mean to obtain 
Instaby (PC, pp”) respectively Stabi (PC, pp”) =l1- Instab (PC, pp”). 


Stabpy (PC, PP”) = (4) 


3 Simulation Study 


In order to compare the stability indices of the cluster partition and afterwards with 
respect to the internal validation indices, a simulation study was conducted. In the 
following, the setup and execution of this simulation study starting with the data 
generation is briefly presented, and subsequently the results are evaluated. 
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3.1 Data Generation and Execution of Simulation Study 


The simulation study is based on artificial data, which are generated for different 
scenarios. In Table 1, the features that define the data scenarios and their corre- 
sponding parameter values are listed. Since a full factorial design is used, there are 
120 different data settings in the conducted simulation study.? The selection of the 
considered features follow the characteristics of the simulation study in [19] and 
were extended with respect to the ratio of the variable types as in [20]. 


Table 1 Features and the associated feature specifications used to generate the data scenarios. 


data parameter feature specification | short 
number of clusters 2,4,8 nc 
clusters of equal size (FALSE: randomly drawn sizes) TRUE, FALSE symm 
number of variables 2,4, 8 nv 

ratio of factor to numerical variables 0.25, 0.5, 0.75 fac_prop 
overlap between cluster groups 0, 0.05, 0.1 overlap 


The clusters of the 200 observations are defined by the the feature settings. Each 
variable can either be active or inactive. For the numerical variables, active means 
drawing values from the normal distribution X1 ~ (uj, 1), with random py, € 
(0, ..., 20}, and inactive means drawing from Xo ~ N (uo, 1) with uo = 2- gif, 
where q is the a-quantile of N (u1, 1) and v € {0.05, 0.1}. This results in an overlap 
of v for the two normal distributions. To achieve an overlap of v = 0, the inactive 
variable is drawn from Aq, — 10, 1). Furthermore, each factor variable has two 
levels, lọ and l1. The probability for drawing lo for an active variable is v and (1 — v) 
for level /;. For an inactive variable, the probability for lọ is (1 — v) and v for /j. 

Below, the code structure of the simulation study is presented. For each of the 120 
data scenarios, a repetition of N — 10 runs was performed. This should mitigate the 
influence of the random initialization of the k-prototypes algorithm. For the range of 
two up to nine cluster groups, the stability indices are determined based on bootstrap 
samples as suggested in [21]. In order to rank the performance of the stability-based 
indices, the internal validation indices were also determined on the same data. 


Pseudo-Code Simulation Study 


for(every data situation) { 
for(i in 1:N)( # 10 iterations to mitigate/soften random influences 
data «- create.data(data situation) 
for(q in 2:951 
output «- kproto(data, k - q, nstarts - 20) 
# stability-based indices determined with the usage of 100 bootstrap samples 
stab_val_method <- stab_kproto(output, B = 100, method) 
int_val_method <- validation_kproto(output, method) # internal validation 
H 
# determine optimal cluster size for every method 
cs method <- max/min(int. val method or stab val method) 
} 
H 


? There is no data scenario with two variables and eight cluster groups. Additionally, if there are 
two variables, obviously only the 0.5 ratio between factor and numerical variables is possible. 
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Fig. 1 The evaluations of the four stability-based cluster indices are presented. There are ten 
repetitions of rating the data situation for k clusters in the range of two to nine and the index- 
optimal number of clusters is highlighted. The parameters of the underlying data structure are nV 
= 8, fac_prop = 0.5, overlap = 06.1 and symm = FALSE. The number of clusters nC in the 
data structure varies row-wise. 


3.2 Analysis of the Results 


Figure 1 shows exemplary results of the simulation study for three different data 
scenarios over the 10 repetitions. Each row of the figure shows a different data 
scenario and each column shows one of the four stability-based indices. The first row 
is related to a data scenario with two clusters (marked by a vertical green line). Each 
plot shows the examined number of clusters and the determined index value for the 10 
repetitions. The maximum index value for each repetition is highlighted with a larger 
dot and marks the index-optimal number of clusters of this repetition. It can be seen 
that all of the four different indices detected the two clusters in the underlying data 
structure. Rows two and three show the evaluations of data with cluster partitions 
of four and eight clusters, respectively. It can be seen that the generated number of 
clusters is not always rated as index optimal (for example, with four clusters, two or 
three clusters were often also evaluated as optimal). Since the results shown here are 
representative for all scenarios, the four cluster indices and their interpretation were 
examined in more detail. 

In the left part of Figure 2, different transformations of the index values are pre- 
sented. Besides the standard index values (green line), the numerical approximation 
of the approach of Ben-Hur et al. mentioned above is also shown (red line). For 
the Jaccard-based evaluation, the proposed cluster-wise stability determination by 
Hennig is presented in orange. Additionally, we propose an adjustment of the index 
values (hereinafter referred to as new adjust), similar to [22], to take into account not 
only the magnitude of the index but also the local slope: The index value scaled with 
the geometric mean of the changes to the neighbor values is presented in dark green. 
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the parameters nC = 4,nV = 8, fac_prop = 0.5,overlap = 0.1 and symm = FALSE. Right: 
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Again, for each variation of the indices, the index optimal value is highlighted. The 
numerically determined index values according to the approach of Ben-Hur et al. 
gain no benefit, thus it can be concluded that the quantification is not appropriate 
for the purpose and that further research is required. The cluster-wise stability de- 
termination of the Jaccard index also does not seem to improve the determination of 
the number of clusters to a large extent. Obviously, the local slope in the example 
in Figure 2 is strengthened for four evaluated cluster groups by the new adjustment 
that leads to a determination of four cluster groups (which is the generated number 
of clusters). Since only one iteration of one data scenario is shown on the left, the 
sum of correct determined number of clusters with respect to the generated number 
of clusters is shown on the right hand side of Figure 2. These sums for two, four 
and eight clusters in the underlying data structure point out the improvement of the 
proposed adjustment of the index values. Especially for more than two clusters, the 
rate of correctly determined numbers of clusters can be increased. 

Finally, the internal validation indices were comparatively examined. For analyz- 
ing the outcome of the simulation study, the determined index optimal numbers of 
clusters are shown in Table 2. While the comparison for two clusters in the underly- 
ing data shows a slight advantage for the stability-based indices, especially for eight 
clusters the preference is in favor of the internal validation indices. To gain a better 
understanding of the mean success rate of determining the correct number of clusters 
for each data scenario, Figure 3 further shows the results of a linear regression on 
the various data parameters. It can be seen that in most cases there is not too much 
difference between the considered methods. The stability-based indices do a better 
job of determining the number of clusters for data with equally large cluster groups. 
Obviously, a larger number of variables causes a better determination of the number 
of clusters. The largest variation in the influence on the proportion of correct deter- 
mination can be seen for the parameter number of clusters. The more cluster groups 
are available in the underlying data structure, the worse the determination becomes 
(especially for the stability-based indices and the indices Ptbiserial and Tau). 
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Table 2 Determined number of clusters for all data scenarios with nC € (2, 4, 8), summarized by 
the stability-based as well as internal validation indices and the evaluated number of clusters. 


clusters} 2 3 4 5 6 7 8 9|2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 
Jnewadj 403 17 0 0 0 0 O 0|47 74298 | 0 0 0 Of 90 70 27 16 16 26 104 II 
Rnewadj 391 18 5 0 1 1 1 3| 56 99 258 3 2 0 O 2] 38 68 22 17 16 32 133 34 
FMnewadj 402 17 1 0.0 0 0 Of 50 80 289 | 0 0 0 OF 88 71 26 16 15 26 106 12 
Lnewadj 394 21 5 0 0 0 0 0|53 83 2822 2 0 0 O Of100 97 31 20 16 16 76 4 
CIndex}313 13 2 2 1 4 18 67 7 27344 13 3 2 3$ 19 2 0 2 4 22 28211 91 
Dunn|386 24 4 2 O 1 1 2] 39 56307 8 7 3 O OF 19 9 1] 7 37 53 190 28 
Gamma|343 9 1 0 1 2 14 50| 9 16 356 15 3 1 a 15 2 1 4 4 16 16 198 119 
GPlus|319 8 1 0 0 0 9 8 G 120.319. 12. » 2 d» Sil 2 1 1 4 14 12 175 151 
McClain] 71 3 1 1 5 12 57270 0 0 177 4 4 13 87 295 0 0 0 0 0 9 34 317 
Ptbiserial | 400 11 6 0 3 00 Of 72 120225 3 0 0 0 Of 31 62 79 65 55 39 26 3 
Silhouette | 388 — 3 1 4 4 5 8 7] 14 37348 7 0 0 8 6 6 0 3 1 12 46 220 72 
Tau}391 16 9 0 4 O O Of 68 144 205 3 0 0 O Of 33 82 119 68 40 14 3 1 
fac_prop nc nv overlap symm 
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Fig. 3 Linear regression coefficients for the parameters of the five data set features, where coeffi- 
cients whose confidence intervals contain 0 are displayed in transparent. 


4 Conclusion 


The aim of this study was to investigate the determination of the optimal number of 
clusters based on stability indices. Several variations of analysis methods of stability- 
based index values were presented and comparatively analyzed in a simulation study. 
The proposed adjustment of the index values with respect not only to their magnitude 
but also to the local slope was able to improve the standard stability indices, especially 
for a smaller number of clusters. The simulation study did not show any general 
superiority of stability-based approaches over internal validation indices. 

In the future, the various methods of analyzing the stability-based index values 
should be examined in more detail, e.g., taking into account the Adjusted Rand 
Index. For this purpose, further research may address the characteristics of the 
evaluated curves more precisely, or further extend the approach of Ben-Hur et al. as 
a quantitative determination method, which has not been done yet. 
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A Review on Official Survey Item Classification 
for Mixed-Mode Effects Adjustment 


Afshin Ashofteh and Pedro Campos 


Abstract The COVID-19 pandemic has had a direct impact on the development, pro- 
duction, and dissemination of official statistics. This situation led National Statistics 
Institutes (NSIs) to make methodological and practical choices for survey collection 
without the need for the direct contact of interviewing staff (i.e. remote survey data 
collection). Mixing telephone interviews (CATI) and computer-assisted web inter- 
viewing (CAWT) with direct contact of interviewing constitute a new way for data 
collection at the time COVID-19 crisis. This paper presents a literature review to 
summarize the role of statistical classification and design weights to control cover- 
age errors and non-response bias in mixed-mode questionnaire design. We identified 
289 research articles with a computerized search over two databases, Scopus and 
Web of Science. It was found that, although employing mixed-mode surveys could 
be considered as a substitution of traditional face-to-face interviews (CAPI), proper 
statistical classification of survey items and responders is important to control the 
nonresponse rates and coverage error risk. 


Keywords: mixed-mode official surveys, item classification, weighting methods, 
clustering, measurement error 
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1 Introduction 


This paper provides a summary of a systematic literature review of the role of 
classification variables and weighting methods of mixed-mode surveys in minimizing 
the measurement error, coverage error, and nonresponse bias. 

Before the COVID-19 pandemic, the statistical adjustment of mode-specific mea- 
surement effects was studied by many scholars. However, after the pandemic, survey 
methodologists made a strong effort to meet the challenges of new restrictions for 
collecting data with proper quality [1]. Data collection with mixing different modes 
by considering their contribution to the overall published statistics was considered 
as a solution by NSIs. The methodologists have been trying to use technology, data 
science, and mixed-device surveys to decrease the expected coverage error and non- 
response bias with new target populations at the time of pandemic rather than the 
traditional interviewer-assisted and paper survey modes [2]. This coverage error is 
caused by the changes of the target population from the general population to the 
general population accessible with technological devices. Te Braak et. al. [3] high- 
lighted how the representativeness of self-administered online surveys is expected 
to be impacted by decreased response rates. Their research demonstrates that a huge 
group of respondents dropout selectively and that this selectivity varies depending 
on the dropout moment and demographic categorical information. 

According to the studies in Statistics Portugal, using classification methods by 
categorical variables and applying the repeated weighting techniques seem to be 
fruitful to estimate and adjust for mode and device effects. Fortunately, many authors 
discussed the use of weights in statistical analysis [4]. It is important to improve 
inference in cases where mixed-mode effects are combined with measurement errors 
caused by primary data collection on categorical variables and socio-demographic 
information. On one side that the categorical variables are collected with the help 
of responders (primary data), the survey mode has a strong impact on answering 
behaviors and answering conditions. Respondents might evaluate some of the new 
categorical variables as sensitive information or privacy intrusive. They may not be 
willing to share these personal data by telephone or technological devices, which 
are necessary for statistical classification. Additionally, for NSIs, also the new data 
collection channels are costly and redesign of the survey estimation methodology 
is time consuming. On the other side, the categorical variables should be available 
in sampling frames (secondary data) and the coverage error is the main concern. 
For instance, in CATI surveys of Statistics Portugal after COVID-19, the population 
was considered as belonging to the following categories: (i) households with a listed 
landline telephone, (ii) households that do not have a telephone but use only a mobile 
telephone, and (iii) households that do not have a telephone at all (or whose number 
is unknown). We could expect these households with very different socioeconomic 
characteristics, and new methods of classification or clustering as helpful methods 
for measurement error adjustment at the time of the pandemic. However, if they 
are different in the important categorical variables of our survey, then a weighting 
solution could amplify a part of the sample, which does not represent the population. 
As a result, statistical classification would be another source of bias instead of 
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solving the problem. Therefore, we could expect two approaches. First, we could 
ignore classification, simply because we consider the groups are homogeneous and 
the weighting could be recommended to adjust for COVID-19 pandemic situation and 
non-observation errors. Second, the groups or responders are different and we need 
categorical variables. In this case, the non-observation errors of CATI and CAWI 
could not be covered by changing only the weights and we have to recommend CAPI 
to collect categorical information and apply both clustering and weighting together 
to have a reasonable coverage by mixed modes. 

This study undergoes a systematic literature review on this topic guided by the 
following question. What is the best methodology or modified estimation strategy 
to mitigate the mode-effects problems based on design weighting and classification? 
To answer this question, we performed a systematic review analysis limited to the 
following databases: Web of Science, Scopus, and working papers from NSIs. We 
only considered papers written in English. This article is organized as follows: 
Section 2 presents the methodology of research that maps keyword identification 
search, databases, and bibliometric analysis. In Section 3, we present the results, 
identifying the PRISMA flow diagram, characteristics of the articles, author co- 
authorship analysis, as well as the Keywords occurrence over the years. In Section 4, 
we discuss the content analysis. Section 5 is about the main conclusions and finally, 
in Section 6, the main research gaps and future works are outlined. 


2 Methods 


To accomplish the research, the preferred reporting items for systematic reviews and 
meta-analysis methodology were adopted. The algorithm of the paper selection from 
databases (Scopus and WOS) was based on screening started by search keywords 
((mixed-mode* OR "Mode effect*") AND (weighting OR weight* OR classifica- 
tion) AND ("Measurement error*" OR "Non-response bias" OR "Data quality" OR 
"response rate*" ) AND ( capi OR "Computer Assisted Personal Interview*" OR 
cawi OR "Assisted Web Interview*" OR cati OR "Computer Assisted Telephone In- 
terview*" OR "web survey*" OR "mail survey*" OR "telephone survey*" )) and then 
the result was filtered by "official statistics". The results of the two databases were 
merged, and then duplication was removed. For bibliometric analysis, the Mendeley 
open-source tool was used to extract metadata and eliminate duplicates. For network 
analysis, the VOSviewer open-source tool has been applied to visualize the extracted 
information from the data set and obtain the quantitative and qualitative outcomes. 
After assessing the eligibility, books and review papers were omitted from results 
and relevant articles picked up from databases. The final dataset was selected ac- 
cording to the visual abstract in Figure 2, which shows detailed information about 
this systematic literature review. 
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Publications excluded: 
Publications identified from: Books (n=20) 
Scopus (n = 178) Publications removed before screening. Publications identified from: Book Chapters(n=17) 
Web of Science (n =34) Duplicate records removed (n = 8) Backward Citation search (n =231) Review papers (n=26) 


Editorials (n=1) 


Publications excluded 
PME Sought for retrieval mE Cn. Gamal: 
ipanese documents (n -10) 


Pubioeions assessed for eligibility Book Chapt rer 
book series(n=2) 


(n=18: 


Notes (n=1) 
| Books (n =20) 


* 
Review papers (n 715) Eligible Documents (n =167) 
Eligible Documents (n 7133) 
publications excluded: 
Total publications assessed for M 
Noe rm) 


Studies included in review (n =289) 


: p| Publications excluded: 
= Publications acreaned (n= 204) irelevant Subject area (n = 6) 


Fig. 1 Literature review flow diagram. (Source: Author’s preparation). 
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Fig. 2 Density visualization analysis of the 22 leader authors who have at least 3 papers. 


3 Results 


The 28 leader authors who had at least 4 papers are presented in Figure 2. Author 
occurrence analysis was performed by applying the VOSviewer research tool for 
network analysis. The top three leader authors were Mick P. Couper with 14 articles, 
Barry Schouten with 14 articles, and Roger Tourangeau with 11 articles. With the 
help of VOSviewer, keywords' analysis was accomplished. We analyzed the co- 
occurrence of author keywords with the full counting method. In the first step, 
we select one for the minimum occurrence of a keyword and the result was 711 
keywords. We could see the application of keywords over years (Figure 3). Some of 
the keywords were not exactly the same, but their use and meaning were the same. 
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Fig. 3 Application of keywords over years. 


We decided to match similar words to make the output clearer. Choosing the full 
counting method resulted in a total of 592 authors meeting the threshold. 


4 Content Analysis 


The studies emphasize the dramatic change in mixed-mode strategies in the last 
decades based on design-based and model-assisted survey sampling, time series 
methods, small area estimation [6], and high expectation to undergo further changes 
especially after the magnificent experience of NSIs, trying new modes after COVID- 
19 pandemic [7]. 

The problem is about mixed-mode effects and calibration, and briefly, we could 
follow several approaches such as design weighting to find sampling weights, non- 
response weighting adjustment, and calibration. The design weight of a unit may be 
interpreted as the number of units from population represented by a specific sample 
unit. Most surveys, if not all, suffer from nonresponse in item or unit. Auxiliary 
information could be used to improve the quality of design-weighted estimates. An 
auxiliary variable must have at least two characteristics to be considered in calibra- 
tion: (i) It must be available for all sample units; and (ii) Its population total must be 
known. 

The categorical variables from the demographic information of nonrespondents 
such as education level, age, income, location, language, and marital status could 
help the survey methodologists to categorize the target population and recognize 
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the best sequence of the modes [8]. Van Berkel et al. [9] considered nine strata in 
their classification tree by using age, ethnicity, urbanization, and income as explana- 
tory variables. Re-interview design and inverse regression estimator (IREG) are 
among the best approaches to improve measurement bias by using related auxiliary 
information [10]. 

The focus of this approach is on the weights of estimators rather than the bias 
from the measurements. For an estimator, we could consider y;,,, the measurement 
obtained from unit i through mode m. The y;, consists of u; as the observed value 
for respondent i, an additive mode-dependent measurement bias b,,, and a mode- 
dependent measurement variance e; ,, with an expected value equal to zero. Equation 
(1) shows the measurement error model. 


Yi,m = uj t Dy t Sim (1) 


If we consider two different modes m and m, then the differential measurement 
error between these two modes is given by 


Yim — Yi = (bm — bm) + (Ei,m — €i) Q) 


The expected value of (bm — bm) is the differential measurement bias. If we 
consider f, as an estimation of the total of variable y according to its observations 
in different modes y; m, then 


i = e (3) 
i=l 


where w; is a survey weight assigned to unit i with n the number of respondents. 
From a combination of equations (2) and (3), and taking the expectation over the 
measurement error model (1), we would have 


n n n n 
E(fS)-E De WiYi,m| = 2 WiUi,m + 23 bmwiði,m + bi Wiði mE (Eim) (4) 
i=l i=l izl i=l 


with Ôi,m 


= ] if unit i responded through mode m, and zero otherwise. Since 
E (i,m) =0 


n n n 
E(f)-E 2 Oiyi,m| = » Olim + 23 OO, mDm (5) 
i=l i=l i=l 


stating that the expected total of the survey estimate for Y consists of the estimated 
true total of U, plus true total of bm from data collected through mode m. Since 
bm is an unobserved mode-dependent measurement bias, 57^ , «;Ój ,, b, in equation 
(5) indicates the existence of an unknown mode-dependent bias for estimation of 
ty. According to Equation (5), there is an unknown measurement bias in sequential 
mixed-mode designs that might be adjusted by different estimators. Data obtained 
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via a re-interview design or a sub-set of respondents to the first stage of a sequential 
mixed-mode survey provides necessary auxiliary information to adjust measurement 
bias in sequential mixed-mode surveys. Klausch et al [10] propose six different 
estimators and show that an inverse version of regression estimator (IREG) performs 
well under all considered scenarios. The idea of IREG is to use re-interview data to 
estimate the inverse slope of ordinary or generalized least squares linear regression 
of benchmark measurements y”? on y"'; as follows [11] 


yi) = Bo By?" (6) 


and estimate the measurement of target variable by applying the inverse of f; in the 
following estimator, so-called inverse regression estimator 


23 diy?’ +9 di (see = A (sre! p x) b-Lh29bej 0) 


where $, d and $7” are the respondents means of focal and benchmark mode outcome 
in the re-interview and d; denotes the design weight of the sample design. For 
a detailed presentation and discussion of the methods see Chapter 8.5 in [12]. 
However, for longitudinal studies with different modes at different time points, the 
effect of time on the respondents would make it difficult to estimate the pure mixed- 
mode effect especially for volatile classification variables such as the address for 
immigrants. The solution could be conducting the survey on parallel or separate 
samples to evaluate the time effect and mode effect separately. 

In practice, Statistics Portugal has been using the available information of a 
sampling frame as a part of FNA (the dwellings national register database) at the 
time of COVID-19. The situation was considered as telephone numbers are linked 
to a sample drawn from a population register in FNA for the samples for CATI 
rotation-scheme surveys such as Labor Force Survey. In 2020, the Labour Force 
Survey (LFS) in Portugal as a mandatory survey for the member states within the EU 
was adjusted for undercover of the percentage of households with a listed landline 
telephone. As a result, the comparison of these surveys after and before COVID-19 
shows the usefulness of the discussed methodologies. In 2021, the successful CAWI 
mode census by Statistics Portugal shows respondents tend to favor the web-based 
questionnaire to avoid the risk of COVID-19 infection with a face-to-face interview. 
It shows the potential change in the mode tendency by responders. 


5 Conclusions 


COVID-19 crisis led to new solutions on item classification for mixed-mode effects 
adjustment, such as applying mode calibration to population subgroups by cate- 
gorical variables such as gender, regions, age groups, etc. Studies offer sequential 
mixed-mode design started with CAWI as the cheapest mode supported by an initial 
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postal mail or telephone contact and possible cash incentive. With a lag, follow up 
the non-respondents with giving them a choice between CAPI and CATI according 
to their specific classification group and demographic information, such as education 
level, age, income, location, language, and marital status. It is fruitful to reduce the 
cost and increase the accuracy simultaneously. 

This study showed that sample frames might need updates for necessary categor- 
ical information, which are based on choices made several years ago. Additionally, 
more research studies seem necessary for ethics concerns, privacy regulations, and 
standards for using categorical variables and classification information in social 
mixed-mode surveys and official statistics. 
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Clustering and Blockmodeling Temporal 
Networks — Two Indirect Approaches 


Vladimir Batagelj 


Abstract Two approaches to clustering and blockmodeling of temporal networks 
are presented: the first is based on an adaptation of the clustering of symbolic data 
described by modal values and the second is based on clustering with relational 
constraints. Different options for describing a temporal block model are discussed. 


Keywords: social networks, network analysis, blockmodeling, symbolic data anal- 
ysis, clustering with relational constraints 


1 Temporal Networks 


Temporal networks described by temporal quantities (TQs) were introduced in the 
paper [2]. We get a temporal network Ny = (V, £,7 ,9,"W) by attaching the time 
T to an ordinary network, where V is the set of nodes, £ is the set of links, P is 
the set of node properties, W is the set of link weights, and T = [Tmin, Tmax) is a 
linearly ordered set of time points t € 7 which are usually integers or reals. 

In a temporal network nodes/links activity/presence, nodes properties, and links 
weights can change through time. These changes are described with TQs. A TQ 
is described by a sequence a = [(s;, fj, vj) : r = 1,2,...,k] where [s,, f.) 
determines a time interval and v, is the value of the TQ a on this interval. The set 
Ta = U,.[sr. f») is called the activity set of a. For t € Ta its value is undefined, 
a(t) = 38. 

Assuming that for every x € RU{#}:x+#=#+x=xandx - 36 2 88. x — 36 
we can extend the addition and multiplication to TQs 
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(a b)(t) 2 a(t) - b(t) and Toi, 2 Ta U Tp 
(a-b)(t) 2 a(t) - b(t) and T4, 2 Ta Tp 


Let Ty (v) € T, Ty € P, be the activity set for a node v € V and T; (£) CT, 
Ty, € W, the activity set for a link £ € £L. The following consistency condition 
must be fulfilled for activity sets: If a link £(u, v) is active at the time point ż then its 
end-nodes u and v should be active at the time point t : Tz (£(u,v)) € Ty (u) ATy (v). 

In the following we will need 


1. Total: total(a) = X}; (fi — si) vi 
total(a) 


[Ta | 
3. Maximum: | max(a) = max; vi 


2. Average: average(a) = where |7,| = M;(fi —si) 


To support the computations with TQs we developed in Python the libraries TQ 
and Nets, see https: //github.com/bavla/TQ. 


2 Traditional (Generalized) Blockmodeling Scheme 


A blockmodel (BM) [11] consists of structures 
obtained by identifying all units from the same 
cluster of the clustering / partition C = {C;}, 
z(v)) = i € v € Cj. Each pair of clusters 
(Ci, Cj) determines a block consisting of links 
linking C; to C;. For an exact definition of a 
blockmodel we have to be precise also about 


which blocks produce an arc in the reduced a ^| 
graph on classes and which do not, what is the " 
weight of this arc, and in the case of general- O——-=0 


ized BM, of what type. The reduced graph can 
be represented by relational matrix, called also — Fig. 1 Blockmodel. 
image matrix. 


To develop a BM method we specify a criterion function P(u) measuring the 
"error" of the BM u. We can introduce additional knowledge by constraining the 
partitions to a set ® of feasible partitions. We are searching for a partition z* € ® 
such that the corresponding BM y* minimizes the criterion function P(j). 


3 BM of Temporal Networks 


For an early attempt of temporal network BM see [2, 5]. To the traditional BM 
scheme we add the time dimension. We assume that the network is described using 
temporal quantities [2] for nodes/links activity/presence, and some nodes properties 
and links weights. Then also the BM partition zt is described for each node v with a 
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temporal quantity z(v, t): zt (v, f) = i means that in time f node v belongs to cluster 
i. The structure and activity of clusters C;(t) = (v : 2(v, t) = i) can change through 
time, but they preserve their identity. 

For the BM , the clusters are mapped into BM nodes u : C; — [i]. To determine 
the BM we still have to specify how the links from C; to C; are represented in the 
BM - in general, for the model arc ( [i], [/]). we have to specify two TQs: its weight 
aij and, in the case of generalized BM, its type tij. The weight can be an object of a 
different type than the weights of the block links in the original temporal network. 

We assume that in a temporal network N = (V,L£,7,f, W) the links weight is 
described by a TQ w € W. In the following years we intend to develop BM methods 
case by case. 


1. constant partition — nodes stay in the same cluster all the time: 


a. indirect approach based on clustering of TQs: p(v) = Xuen (y) wv. u), 
hierarchical clustering and leaders; 

b. indirect approach by conversion to the clustering with relation constraint 
(CRC); 

c. direct approach by (local) optimization of the criterion function P over ® 


2. dynamic partition — nodes can move between clusters through time. The details 
are still to be elaborated. 


In this paper, we present approaches for cases 1.a and 1.b. 
In the literature there exist other approaches to BM of temporal networks. A 
recent overview is available in the book [12]. 


3.1 Adapted Symbolic Clustering Methods 


In [8] we adapted traditional leaders [13, 10] and agglomerative hierarchical [14, 1] 
clustering methods for clustering of modal-valued symbolic data. They can be almost 
directly applied for clustering units described by variables that have for their values 
temporal quantities. 

For a unit X;, each variable V; is described with a size h;; and a temporal quantity 
Xij, Xi; = (hij, xij). In our algorithms we use normalized values of temporal 
variables V' = (h, p) where 


Ur 


p=([(s,, f p):r2 ,,2,....k] and pr 


In the case, when h = total(x), the normalized TQ p is essentially a probability 
distribution. 
Both methods create cluster representatives that are represented in the same way. 
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3.2 Clustering of Temporal Network and CRC 


To use the CRC in the construction of a nodes partition we have to define a dissim- 
ilarity measure d(u, v) (or a similarity s(u, v)) between nodes. An obvious solution 
is s(u, v) = f(w(u, v)), for example 


1. Total activity: — s1(u,v) = total(w(u, v)) 
2. Average activity: | s?(u,v) = average(w(u, v)) 
3. Maximal activity: | s3(u,v) = max(w(u, v)) 


We can transform a similarity s(u,v) into a dissimilarity by d(u,v) — OX] or 
d(u,v) = S — s(u,v) where S > max,,, s(u, v). In this way, we transformed the 
temporal network partitioning problem into a clustering with relational constraints 
problem [6, 360—369]. It can be efficiently solved also for large sparse networks. 


3.3 Block Model 


Having the partition z, to produce a BM we have to specify the values on its links. 
There are different options for model links weights a(({i], [7])). 


1. Temporal quantities: a(([i], [j])) = activity(C;, Cj) = Ducci, vec w(u, v), for 
i + j, and a(([i], [;]])) = sactivity(C;, Ci). 
2. Total intensities: a, (([i], [/])) = total(a(([i], L/])) - 
(LD) 


3. Geometric average intensities: ag(([i], [j])) = e 
ICi| - |C;] 


4 Example: September 11th Reuters Terror News 


The Reuters Terror News network was obtained from the CRA (Centering Resonance 
Analysis) networks produced by Steve Corman and Kevin Dooley at Arizona State 
University. The network is based on all the stories released during 66 consecutive 
days by the news agency Reuters concerning the September 11 attack on the U.S., 
beginning at 9:00 AM EST 9/11/01. 

The nodes, n = 13332, of this network are important words (terms). For a given 
day, there is an edge between two words iff they appear in the same utterance (for 
details see the paper [9]). The network has m — 243447 edges. The weight of an 
edge is its daily frequency. There are no loops in the network. The network Terror 
News is undirected — so will be also its BM. 

The Reuters Terror News network was used as a case network for the Viszards 
visualization session on the Sunbelt XXII International Sunbelt Social Network 
Conference, New Orleans, USA, 13-17. February 2002. It is available at http: 
//vlado.fmf.uni-1j.si/pub/networks/data/CRA/terror.htm. 
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We transformed the Pajek version of the network into NetsJSON format used in 
libraries TQ and Nets. For a temporal description of each node/word for clustering 
we took its activity (sum of all TQs on edges adjacent to a given node v) 


act(v) = > w(v : u). 


uéN (v) 


Our leaders' and hierarchical clustering methods are compatible — they are based 
on the same clustering error criterion function. Usually, the leaders’ method is used 
to reduce a large clustering problem to up to some hundred units. With hierarchical 
clustering of the leaders of the obtained clusters, we afterward determine the "right" 
number of clusters and their representatives. 
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Fig. 2 Hierarchical clustering of 100 leaders in Terror News. 
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To cluster all 13332 words (nodes) in Terror News we used the adapted leaders’ 
method searching for 100 clusters. We continued with the hierarchical clustering of 
the obtained 100 leaders. The result is presented in the dendrogram in Figure 2. 
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Fig. 3 Word clouds for clusters C58 and C81. 
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To get an insight into the content of a selected cluster we draw the corresponding 
word cloud based on the cluster's leader. In Figure 3 the word clouds for clusters 
C58 and C81 (|C58| = 1396, |C81| = 2226 ) are presented. 

We can also compare the activities of pairs of clusters by considering the overlap 
of p-components (probability distributions) of their leaders. In Figure 4, we com- 
pare cluster C58 with cluster C81, and cluster L96 with cluster C66. In the right 
diagram some values are outside the display area: L96[15] = 0.3524, C66[4] = 
0.1961, C66[5] 2 0.2917. 


C58:C81 L96:C66 


Den sera ELSDIPPRSRRRHERRRSSDRIREARRERRSIUTIUSUTOSSUDISSONRSDOISS Dum nena RISDIPRDSRRRHERRRESRRRIREARRERRSTUDIQSUSTSUDISSDNRSIUGIES 


Fig. 4 Comparing activities of clusters (blue — first cluster, red — second cluster, violet — overlap). 


We decided to consider in the BM the clustering of Terror News into 5 clusters 
€ = (C94, C88, C95, L43, L74}. The split of cluster C95 gives clusters of sizes 325 
and 629 (for sizes, see the right side of Figure 5). Both clusters C94 and C88 have a 
chaining pattern at their top levels. 

Because of large differences in the cluster sizes, it is difficult to interpret the total 
intensities image matrix. An overall insight into the BM structure we get from the 
geometric average intensities image matrix (right side) and the corresponding BM 
network (cut level 0.3), left side of Figure 5. 


i |cluster 1 2 3 4 5 

1| C94 |123.85 12.23 2.26 1.57 1.42 

< 2| C88 3.58 0.33 0.22 0.19 
B 3| C95 0.56 0.07 0.07 
4| L43 0.38 0.08 

5| L74 0.39 


| \ | size || 6018 5109 954 535 716] 


Fig. 5 Block model and image matrix. 
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A more detailed BM is presented by the activities (p-components) image matrix 
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in Figure 6. 
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Fig. 6 BM represented as p-components of temporal activities of links between pairs of clusters. 


A more compact representation of a temporal BM is a heatmap display of this 
matrix in Figure 7. Because of some relatively very large values, it turns out that the 
display of the matrix with logarithmic values provides much more information. 
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To the Terror News network, we applied also the clustering with relational con- 
straints approach. Because of the limited space available for each paper, we can not 
present it here. A description of the analysis with the corresponding code is available 
at https: //github.com/bavla/TQ/wiki/BMRC . 
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5 Conclusions 


The presented research is a work in progress. It only deals with the two simplest 
cases of temporal blockmodeling. We provided some answers to the problem of 
normalization of model weights TQs when comparing them and some ways to 
present/display the temporal BMs. 

We used different tools (R, Python, and Pajek) to obtain the results. We intend to 
provide the software support in a single tool — probably in Julia. We also intend to 
create a collection of interesting and well-documented temporal networks for testing 
and demonstrating the developed software. 
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Latent Block Regression Model 


Rafika Boutalbi, Lazhar Labiod, and Mohamed Nadif 


Abstract When dealing with high dimensional sparse data, such as in recommender 
systems, co-clustering turns out to be more beneficial than one-sided clustering, even 
if one is interested in clustering along one dimension only. Thereby, co-clusterwise is 
a natural extension of clusterwise. Unfortunately, all of the existing approaches do not 
consider covariates on both dimensions of a data matrix. In this paper, we propose 
a Latent Block Regression Model (LBRM) overcoming this limit. For inference, 
we propose an algorithm performing simultaneously co-clustering and regression 
where a linear regression model characterizes each block. Placing the estimate of the 
model parameters under the maximum likelihood approach, we derive a Variational 
Expectation-Maximization (VEM) algorithm for estimating the model's parameters. 
The finality of the proposed VEM-LBRM is illustrated through simulated datasets. 


Keywords: co-clustering, clusterwise, tensor, data mining 


1 Introduction 


The cluster-wise linear regression algorithm CLR (or Latent Regression Model) is 
a finite mixture of regressions and one of the most commonly used methods for 
simultaneous learning and clustering [14, 5]. It aims to find clusters of entities to 
minimize the overall sum of squared errors from regressions performed over these 
clusters. Specifically, X = [xij] € R?” is the covariate matrix and Y € R"*! the 
response vector. The cluster-wise method aims to find g clusters C;,..., C, and 
regression coefficients 8? e R4*! by minimizing the following objective function 


o v k 
Di Brec, Oi- M B Em + bg)? where: 
e yj is the value of the dependent variable for subject/observation i defined by 


X; = (Xii... Xia). 
e xij is the value of the j-th independent variable for subject/observation i, 


. p is the j-th multiple regression coefficient and 5; is the intercept. 
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Various adjustments have been made to this model to improve its performance in 
terms of clustering and prediction. In our contribution, we propose to embed the 
co-clustering in the model. 

Co-clustering is a simultaneous clustering of both dimensions of a data matrix 
that has proven to be more beneficial than traditional one-sided clustering, especially 
when dealing with sparse data. When dealing with high dimensional data sparse or 
not, co-clustering turns out to be more valuable than one-sided clustering [1, 13], 
even if one is interested in clustering along one dimension only. In [4] the authors 
proposed the SCOAL approach (Simultaneous Co-clustering and Learning model), 
leading to co-clustering and prediction for binary data; they generalized the model 
to continuous data. However, this model does not take into account the sparsity 
of data in the sense that it does not lead to homogeneous blocks. The obtained 
results in terms of Mean Square Error (MSE) are good, but in terms of co-clustering 
(homogeneity of co-clusters), no analysis has been presented. This model is also 
related to the soft PDLF (Predictive Discrete Latent Factor) model [2], where the 
value of response y;;’s in each co-cluster is modeled as a sum B xi j + Oxe Where 
f is a global regression model. In contrast, ó;, is a co-cluster specific offset. More 
recently, in [17] the authors proposed an algorithm taking into account only row 
covariates information to realize co-clustering and regression simultaneously. To 
this end, the authors are based on the latent block models [8]. In our contribution, 
we propose to rely also on this model but considering row and column covariates. 

The proposed Latent Block Regression Model (LBRM) is an extension of fi- 
nite mixtures of regression models where the co-clustering is embedded. It allows 
us to deal with co-clustering and regression simultaneously while taking into ac- 
count covariates. To estimate the parameters we rely on a Variational Expectation- 
Maximization algorithm [7] referred to as VEM-LBRM. 


2 From Clusterwise Regression to Co-clusterwise Regression 
2.1 Latent Block Model (LBM) 


Given an n x d data matrix X = (x;;,i € J = {1,... n}; j € J = {1,...,d}). Itis 
assumed that there exists a partition on J and a partition on J. A partition of / x J into 
g x m blocks will be represented by a pair of partitions (z, w). The k-th row cluster 
corresponds to the set of rows i such that zix = 1 and zip = 0 Vk’ + k. Thereby, the 
partition represented by z can be also represented by a matrix of elements in (0, 1)5 
satisfying Xa Zik = 1. Similarly, the £-th column cluster corresponds to the set of 
columns j and the partition w can be represented by a matrix of elements in (0, 1)" 
satisfying 5/7 , Wye = 1. 

Considering the Latent Block Model (LBM) [6], it is assumed that each ele- 
ment xj; of the kfth block is generated according to a parameterized probabil- 
ity density function (pdf) f(x;;;@x¢). Furthermore, in the LBM the univariate 
random variables x;; are assumed to be conditionally independent given (z, w). 
Thereby, the conditional pdf of X can be expressed as P(zix = 1,wje = 1|X) = 
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P(zix = 1|X)P(wje = 1|X). From this hypothesis, we then consider the latent 
block model where the two sets J and J are considered as random samples and the 
row, and column labels become latent variables. Therefore, the parameter of the 
latent block model is © = (x, p, œ), with zx = (71,..., g) and p = (pi, .... Pm) 
where (zy = P(zix = 1),k = 1,...,8), (pe = P(wje = 1),€ = 1,...,m) are 
the mixing proportions and œ = (age;k = 1,...g,€ = 1,...,m) where axe 
is the parameter of the distribution of block k£. Considering that the complete 
data are the vector (X,z, w), i.e, we assume that the latent variable z and w 
are known, the resulting complete data log-likelihood of the latent block model 
Lc(X,z, w, 9) = log f(X,z, w; ©) can be written as follows 


n 


d m 
x zk log zt, + > we log pe + 2, 2 Y X zirwje log Øze(xij; xe). 


i=l j=l k=1 €=1 


where the zs and pe’s denote the proportions of row and columns clusters re- 
spectively; see for instance [8]. Note that the complete-data log-likelihood breaks 
into three terms: the first one depends on proportions of row clusters, the second on 
proportions of column clusters and the third on the pdf of each block or co-cluster. 
The objective is then to maximize the function Lc (z, w, 9). 


2.2 Latent Block Regression Model (LBRM) 


For co-clustering of continuous data, the Gaussian latent block model can be used. For 
instance, note that it is easy to show that the minimization of the well-known criterion 
of |[X - zw" ||? = X4 Xo Xii X juu ij 7 Mee)? where z € (0, 16, 
w € (0, 1) *" and u € R8% is associated to Latent block Gaussian model whith 
Ce = (Uke, 0; p the proportions of row clusters and column clusters are equal and 
in addition the variances of blocks are identical [9]. Note that 1) the characteristic 
of the latent block model is that the rows and the columns are treated symmetrically 
2) the estimation of the parameters requires a variational approximation [7, 17]. In 
the sequel, we see how can we integrate a regression model. Hereafter, we propose a 
novel Latent Block Regression model for co-clustering and learning simultaneously. 
The model considers the response matrix Y = [yij] € R"*4 and the covariate tensor 
X-[Lxij]€ R^X4*" where n is the number of rows, d the number of columns, and 
v the number of covariates. Figure 1 presents data structure for the proposed model 
LBRM. 

In the following we propose the integration of mixture of regression [5] per block 
in the Latent Block model (LBM) considering the distribution ®(y;;|x;;; Axe). We 
assume in the following the normality of ®, 


$(yijlxij; Axe) = pGijlxijs Bees The) = Q02,) 09 ed 20 202, Qj - - Bis? 
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n [ya] n we 
A ; > X — A^ E 


Fig. 1 Data representation for proposed model. 


With the LBRM model, the parameter Q is composed of row and column proportions 
7, p respectively, B = (B4. .... Bgm} with Bl, = (B ol s s Br) where m; 
represents the intercept of regression and © = {c11,..., gm}. The classification 
log-likelihood can be written: 


1 1 

2 T 2 
>) zix log me + X wje log pe - 5 2 zw log oe) - aol, 2, zwi - Bux 
i,k pe ke Oke iG ke 


with z.k = Xj; Zik etwe = zi Wye. 
3 Variational EM Algorithm 


To estimate ©, the EM algorithm [3] is a candidate for this task. It maximizes 
the log-likelihood f(X,Q) w.r. to Q iteratively by maximizing the conditional 
expectation of the complete data log-likelihood Lc(z,w;€3) w.r. to Q, given a 
previous current estimate Q°) and the observed data x. Unfortunately, difficulties 
arise owing to the dependence structure among the variables x; ; of the model. To solve 
this problem an approximation using the [12] interpretation of the EM algorithm can 
be proposed; see, e.g., [7, 8]. Hence, the aim is to maximize the following lower bound 
of the log-likelihood criterion: Fc(Z, w; Q) = Lc(Z, w, Q) + H(z) + H(w) where 
H(Z) = — Xi Zix log Zix with Zi; = P(ziy = 1X), H(W) = — X ; e Wje log Wj with 
je = P(wje = 1X), and Lc(Z, w; ) is the fuzzy complete data log-likelihood (up 
to a constant). Lc (Z, W; Q) is given by 


: P 1 FERME 2 
Lc(z, w,) = ) Zilog me + X djelogpe~ 5 ) 240i. log oj.) 
i,k Je ke 
1 "ED 
= SE > Zik je (ij - Bux) 

Okt i, 


J.k.t 


The maximization of Fc(Z, W, Q) can be reached by realizing the three following 
optimization: update Z by argmaxz Fc (Z, w, Q), update W by argmaxs Fc (Z, W, Q), 
and update Q by argmaxo Fc (Z, W, Q). In what follows, we detail the Expectation 
(E) and Maximization (M) step of the Variational EM algorithm for tensor data. 
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E-step. It consists in computing, for all i, k, j, £ the posterior probabilities Ž;ķ 
and $;; maximizing Fc(Z, Ww, Q) given the estimated parameters Qe. It is easy 
to show that, the posterior probability Zi; maximizing Fc(Z,W,@) is given by: 
Zik OC my exp (Zi; Õje log (pCyij|Xij. Bye» 9xc)) | - In the same manner, the poste- 
rior probability ij;; is given by: Õje œ pe exp (Xi p Zix log (pCyijIxij. Bye. 0e))) 
M-step. Given the previously computed posterior probabilities Z and w, the M-step 
consists in updating , Vk, £, the parameters of the model 7%, pz, u,, and Axe maxi- 
mizing Fc(Z, W, Q). Using the computed quantities from step E, the maximization 
step (M-step) involves the following closed-form updates. 


* Taking into account ras constraints }}; zi = 1 and Xy pe = 1, itis easy to show 
that m% = Xi Zik E = £& and p, = Em Z nS 
* The update of Ake which i is formed by (Bre. Oxe) where can be given by simple 


derivates of Fc (Z, i», Q) with respect to Bge and c, respectively. This leads to 


zi 
T 2 
=H sn T 2 _ ij GküjeCyij — BeeXii) 
Bre = P3 Zik Üjr yijXij b» Zin WjexijXj;| > Ke = » : 
ij ij ij ik je 


The proposed algorithm for tensor data referred to as VEM-LBRM alternates the two 
previously described steps Expectation-Maximization. At the convergence, a hard 
co-clustering is deduced from the posterior probabilities. 


4 Experimental Results 


First, we evaluate the proposed VEM-LBRM on three synthetic datasets in terms of co- 
clustering and regression. We compare VEM-LBRM with some clustering and regres- 
sion methods namely Global model which is a single multiple linear regression 
model performed on all observations, K-means, Clusterwise, Co-clustering 
and SCOAL. We retain two widely used measures to assess the quality of clustering, 
namely the Normalized Mutual Information (NMI) [16] and the Adjusted Rand In- 
dex (ARI) [15]. Intuitively, NMI quantifies how much the estimated clustering is 
informative about the true clustering. The ARI metric is related to the clustering 
accuracy and measures the degree of agreement between an estimated clustering and 
a reference clustering. Both NMI and ARI are equal to 1 if the resulting clustering 
is identical to the true one. On the other hand, we use RMSE (Root MSE) and MAE 
(Mean Absolute Error) metrics to evaluate the precision of prediction while RMSE 
is a loss function which is suitable for Gaussian noises when MAE uses the absolute 
value which is less sensitive to extreme values. 

We generated tensor data X with size 200 x 200 x 2 according to Gaussian 
model per block. In the simulation study, we considered three scenarios by varying 
the regression parameters — the examples have blocks with different regression 
collinearity and different co-clusters structure complexity. The parameters for each 
example are reported in Tables 1. In Figures 2 and 3 are depicted the true regression 
planes and the true simulated response matrix Y. 
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Table 1 Parameters generation for examples. 


Dataset Example 1 l Example 2 Example 3 
x = [0.35, 0.35, 0.3], p = [0.55, 0.45] 
o 0-3 oz] o=7 
10 2 0.3 12 
* z- 5 ae j| z- bi 
Co-clusters | Bxe Hke ke č Hkc | Bre Hre 


Cluster (1,1)| [1.-10, 1] [5,20] |[, -10, 1] [5,20] [[1. -10, 1] [5.20] 
Cluster (1,2)| [10, 4,13] [5,10] |[1, -10, 1] [5,10] |[1, -10, 1] [5,10] 
Cluster (2,1)| [3, 20,-2] [10,20] | [1, -10, 1] [10,201] [1, -10, 1] [5,30] 
Cluster 222) | [-5, 2, -6] [10,10] |[7, 5, -10] [10,10] | [7, 5, -10] [20,10] 
Cluster (3,1) [-10, 20, 10] [20,20] |[7, 5, -10] [20,201 | [7, 5, -10] [20,20] 
Cluster (322) | [7, 5, -10] [20,10] |[7, 5, -10] [20,10] | [7, 5, -10] [20,30] 


Example 1 Example 2 Example 3 


(a) (b) (c) 


(a) (b) (c) 


Fig. 3 Synthetic data: True co-clustering according to the chosen parameters. 


In our illustrations, we consider co-clustering and regression challenges. All 
metrics concerning rows and columns are computed by averaging on ten random 
training, and testing data split using an 80% vs. 20% of training and validation data. 
Thereby, we compare VEM-LBRM with Global model (which is a multiple linear 
regression), K-means, Clusterwise by reshaping the tensor to matrix with size 
N x v where N = n x d. On the other hand, the VEM algorithm for co-clustering is 
applied on response matrix Y. Furthermore, for clustering algorithms, the RMSE, 
MAE, and R-squared are computed by applying linear regression on each obtained 
co-cluster. In Table, 2 are reported the performances for all algorithms. The missing 
values represent measures that cannot be computed by the corresponding models. 
From these comparisons, we observe that whether the block structure is easy to 
identify or not, the ability of VEM-LBRM to outperform other algorithms. 

To go further, note that in [11], the authors reformulated the clusterwise and 
introduced the linear cluster-weighted model (CWM) in a statistical setting and 
showed that it is a general and flexible family of mixture models. They included in 
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Table 2 (co)-clustering and prediction: mean and sd in parentheses. 


Regression Clustering 
Examples| Algorithms RMSE MAE Rsquare ARI NMI 
Training Test |Training Test 
Global model | 16438 164.05 | 145.29 145.05 
(0.49) | (0.08) (0.71) 

K-means 49.2 4951 | 3486 3491 | 08 [061 - [049 - 

(60.2) (67.48 ) |( 33.56) (35.79 )| (0.02) | 02. - | 0.03 - 

z Clusterwise | 154.57 15447 | 127.77 127.93 | 0.52 [007 - |00 - 

E (g =3) (001) (036) |(003) (045)| (00) | o0 - [00 - 
E Co-clustering| 10.86 10.83 | 729 729 | 088 [084 10|071 LO 
a (g =3) (14.76) (14.36) | (4.67) (4.59) | (0.0) | 0.01 0.0 | 0.04 0.0 
SCOAL 1499 1492 | 1045 1041 | 099 [091 10] 084 LO 

(g -3,m-2) |(207.56) (208.91 )|( 89.48 ) (90.55)| (0.0) | 0.01 0.0 | 0.04 0.0 
VEM-LBRM 7.1 7.06 | 529 526 | 0.99 [095 10/092 1.0 
(g =3,m=2) | (17.71) (0686)| (68) (632)| (0.0) 0.01) (0.0) | (0.03) (0.0) 

Global model | 29.15 2921 | 2464 24.68 | 0.34 gomez iz 

(0.04) (015) |(004) (0.12)| (000 | - -| - - 

K-means 1043 1049 | 773 771 | O71 [056 - |045 - 

(025) (024) |(0.17) (016)| (001) | 00 - [00 - 

q Clusterwise | 18.54 18.62 | 1133 1138 | 073 [015 - [016 - 

E (g =3) (0.09) (0.27) | (0.06) (0.14)| (00) | 00 - [00 - 
E Co-clustering 25 7.49 5.89 59 0.8 0.95 1.0 | 0.94 1.0 
a (g =3) (135) (138) | (0.82) (0.86) | (0.07) | 0.14 0.01017 0.0 
SCOAL 1263 1269 | 875 881 | 081 [097 10] 054 LO 

(g =3,m=2) | (12.57) (12.81) | (7.38) (7.58) | (035) | 0.1 0.0] 0.17. 0.0 
VEM-LBRM 699 699 | 557 557 | 096 | 10 10] 10 10 
(g-3,m-2) | (0.01) (06.04) | (0.01) (0.02)| (0.0) | (0.0) (0.0)| (0.0) (0.0) 

Global model | 45.38 45.24 | 3833 3821 | 0.49 *ccwSBOE x 

(006) (024) |(007) (0260| (000 | - -| -  - 

K-means 1047 1041 | 744 742 | 083 [054 - |045 - 

(173) (174) |(L08) (108)| (008) | 001 - [O01 - 

pr Clusterwise | 23.09 2318 | 1209 1215 | 087 [009 - [009 - 

E (g =3) (1.84) (2.02) | (1.23) (1.29) | (0.02) | 00 - |00 - 
E Co-clustering| 9.48 939 | 698 693 | 073 [074 10| 07 10 
ü (g =3) (0.16) (0.22) | (0.01) (0.02) | (0.02) | 0.04 0.0 | 0.08 0.0 
SCOAL 2732 2714 | 1682 1673 | 057 [098 10|0396 1.0 
(g-3,m-2) | (4197) (41.83) |( 24.13 ) (2416)| (093) | 0.07 0.0 | 0.12 0.0 
VEM-LBRM 721 721 | 571 571 | 099 [098 10[|096 10 
(g =3,m=2) | (0.68) (0.7) |(0.42) (0.42) | (0.0) 0.07) (0.0) | (0.12) (0.0) 

the classical model of clusterwise the probability 5'(x;|(3,.) to model the covariates, 
whereas the classical cluster-wise model the output only using &(y;|x;; Ax). They 


prove that sufficient conditions for model identifiability are provided under a suitable 
assumption of Gaussian covariates [10]. We can include in LBRM a joint probability 
®’(x;;|Qxe) where O;; = [Mpe Lec] to evaluate its impact in terms of clustering 
and regression. Figure 4 presents the graphical model of LBRM and its extension. 
The first experiments on real datasets give encouraging results. 


Fig. 4 Graphical model of LBRM (left) and its extension (right). 
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5 Conclusion 


Inspired by the flexibility of the latent block model (LBM), we proposed extending it 
to tensor data aiming at both tasks: co-clustering and prediction. This model (LBRM) 
gives rise to a variational EM algorithm for co-clustering and prediction referred to 
as VEM-LBRM. This algorithm which can be viewed as the co-clusterwise algorithm 
can easily deal with sparse data. Empirical results on synthetic data showed that 
VEM-LBRM does give more encouraging results for clustering and regression than 
some algorithms that are devoted to one or both tasks simultaneously. For future 
work, we plan to develop the extension of LBRM and apply the proposed models for 
the recommender system task. 
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Using Clustering and Machine Learning 
Methods to Provide Intelligent Grocery 
Shopping Recommendations 


Nail Chabane, Mohamed Achraf Bouaoune, Reda Amir Sofiane Tighilt, Bogdan 
Mazoure, Nadia Tahiri, and Vladimir Makarenkov 


Abstract Nowadays, grocery lists make part of shopping habits of many customers. 
With the popularity of e-commerce and plethora of products and promotions avail- 
able on online stores, it can become increasingly difficult for customers to identify 
products that both satisfy their needs and represent the best deals overall. In this 
paper, we present a grocery recommender system based on the use of traditional 
machine learning methods aiming at assisting customers with creation of their gro- 
cery lists on the MyGroceryTour platform which displays weekly grocery deals in 
Canada. Our recommender system relies on the individual user purchase histories, 
as well as the available products’ and stores’ features, to constitute intelligent weekly 
grocery lists. The use of clustering prior to supervised machine learning methods 
allowed us to identify customers profiles and reduce the choice of potential products 
of interest for each customer, thus improving the prediction results. The highest 
average F-score of 0.499 for the considered dataset of 826 Canadian customers was 
obtained using the Random Forest prediction model which was compared to the 
Decision Tree, Gradient Boosting Tree, XGBoost, Logistic Regression, Catboost, 
Support Vector Machine and Naive Bayes models in our study. 


Keywords: clustering, dimensionality reduction, grocery shopping recommenda- 
tion, intelligent shopping list, machine learning, recommender systems 
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1 Introduction 


Grocery shopping is a common activity that involves different factors such as budget 
and impulse purchasing pressure [1]. Customers typically rely on a mental or digital 
list to facilitate their grocery trips. Many of them show a favorable interest towards 
tools and applications that help them manage their grocery lists, while keeping 
them updated with special offers, coupons and promotions [2, 3]. Major retailers 
throughout the world typically offer discounts on different products every week in 
order to improve sales and attract new customers. This very common practice leads 
to the fact that thousands of items go on special simultaneously across different 
retailers at a given week. The resulting information overload often makes it difficult 
for customers to quickly identify the deals that best suit their needs, which can become 
a source of frustration [4]. To address this problem, many grocery stores have taken 
advantage of the popularity of e-commerce to set up their own websites featuring 
various functionalities, including Recommender Systems, to assist customers during 
the shopping process. 

Recommender Systems (RSs) [5] are tools and techniques that offer personalized 
suggestions to users based on several parameters (e.g. their past behavior). RSs have 
recently become a field of interest for researchers and retailers as many e-commerces, 
online book stores and streaming platforms have started to offer this service on their 
websites (e.g. Amazon, Netflix and Spotify). Here, we recall some recent works in 
this field. Faggioli et al. [6] used the popular Collaborative Filtering (CF) approach 
to predict the customer's next basket in a context of grocery shopping, taking into 
account the recency parameter. When comparing their model with the CF baseline 
models, Faggioli et al. observed a consistent improvement of their prediction results. 
Che et al. [7] used attention-based recurrent neural networks to capture both inter- 
and intra-basket relationships, thus modelling users' long-term preferences dynamic 
short-term decisions. 

Content-based recommendation has also proven efficient in the literature, as 
demonstrated by Xia et al. [8] who proposed a tree-based model for coupons recom- 
mendation. By processing their data with undersampling methods, the authors were 
able to increase the estimated click rate from 1.20% to 7.80% as well as to improve 
significantly the F-score results using Random Forest Classifier and the recall results 
using XGBoost. Dou [9] presented a statistical model to predict whether a user will 
buy or not buy an item using Yandex's CatBoost method [10]. Dou relied on contex- 
tual and temporal features as well as on some session features, such as the time of 
visit of specific web pages, to demonstrate the efficiency of CatBoost in this context. 
Finally, Tahiri et al. [11] used recurrent and feedforward neural networks (RNNs and 
FFNs) in combination with non-negative matrix factorization and gradient boosting 
trees to create intelligent weekly grocery baskets to be recommended to the users 
of MyGroceryTour. Tahiri et al. considered different (from our study) features char- 
acterizing the users of MyGroceryTour to provide their predictions, with the best 
F-score results of 0.37 obtained from the augmented dataset. 
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2 Materials and Methods 


2.1 Data Considered 


In this section we describe the dataset obtained from MyGroceryTour website used 
in our research. MyGroceryTour [11] is a Canadian grocery shopping website and 
database available in both English and French languages. The main purpose of the 
website is to present weekly specials offered by the major grocery retailers in Canada. 
It allows users to display grocery products available in their living area, compare 
their products over different stores as well as to build their grocery shopping baskets 
based on the provided insights. MyGrocery Tour users can easily archive and manage 
their grocery lists and access them at any given time. 

In this study, we considered 826 MyGroceryTour users with varying numbers of 
grocery baskets (between 3 and 100 baskets were available per user). The grocery 
baskets contained different products added by users when they were creating their 
weekly shopping lists. In our recommender system (i.e current basket prediction 
experiment), we have considered the following features: 


* user id : unique user identifier (numerical) 

* basket id : unique basket identifier (numerical) 

* product id : unique product identifier (numerical) 

* category : category of the product (categorical) 

* price : price of the product (numerical) 

e special : discount on the product (in 9o) compared to regular price (numerical) 

e distance min : minimal distance between user's home and the closest store where 
the product was available (numerical) 

* distance mean : mean distance between user's home and all stores where the 
product was available (numerical) 

* availability : availability of the product at different stores (binary) 


In addition, we engineered the total bought feature which represents, for each 
product, the total number of times it has been bought over all users. 


2.2 Data Normalization 


Data normalization is an important data preprocessing step in both unsupervised and 
supervised machine learning [12] as well as in data mining [13]. Prior to feeding the 
data to our models we rescaled the available features using z-score standardization. 
Thus, all rescaled features had the mean of 0 and the standard deviation of 1: 


xf -uf 
of | 


z(xy)- (1) 


where x; is the original value of the observation at feature f, uy is the mean and 
c y is the standard deviation of f. 
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2.3 Further Data Preprocessing Steps 


In order to determine which weekly products could be recommended to a given 
user we propose to classify them using both clustering (unsupervised learning) 
and traditional supervised machine learning methods. The final recommendation is 
obtained based on the availability of the products, the data on the products' regular 
prices and available discounts, as well as on the user's shopping history. In our 
context, the baskets contain only the products bought by the users. The information 
about the other available products (not selected by the user at the moment he/she 
organized his/her shopping basket) is also available on MyGroceryTour. It has been 
used to create a large class of available items that were not bought by the user. 

While we considered the items bought by a given user as positive feedback, we 
regarded the items that were available to this user at the time of the order, but not 
acquired by him/her, as a negative feedback. For an order of size P, if T is the total 
amount of items available to the user at the time of the order, the negative feedback 
N for that order is N — T — P. In this context, N usually represents thousands of 
products, while P is typically inferior to 50. This difference in size between positive 
and negative feedback can lead to a situation of imbalanced training data and could 
result in an important loss in performance. Similarly to Xia et al. [8], we applied an 
undersampling method to balance our data instead of considering all of the available 
disregarded items as the negative feedback. 

To identify customer profiles and perform a preselection of products that are 
susceptible to be of interest to a given user, we first carried out the clustering of the 
normalized original dataset (the K-means [14] and DBSCAN [15] data partitioning 
algorithms were used). Then, we limited the choice of the items offered to a given user 
to the products purchased by the members of his/her cluster. By doing so, we managed 
to reduce the amount of products which could be recommended to the user and thus 
minimize eventual classifications mistakes. The clustering phase is detailed in the 
Subsection 2.4. Then traditional machine learning methods were used to provide 
the final weekly recommendation. The size S of the weekly basket recommended 
to a given user was equal to the mean size of his/her previous shopping baskets. 
As the number of items to be recommended by the machine learning methods 
was often greater than S, we retained as final recommendation the top S items, 
ranked according to the confidence score (i.e. the probability estimate for a given 
observation, computed using the predict proba function from the scikit-learn [16] 
library). 


2.4 Data Clustering 


In this section, we present the steps we carried out to obtain the clusters of users. As 
explanatory features used to generate clusters, we considered the mean prices and 
mean specials of the products purchased by the user as well as a new feature, called 
here the fidelity ratio FR„, which is meant to give insight on whether a given user u 
has a favorite store where he/she makes most of his/her grocery purchases. FR, is 
defined as follows: 
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1 
FR, = Xmax,u m (n-1) ya Xiu (2) 


Xtotal,u 


where Xmax,u is the total number of products bought by user u at the store where 
he/she made most of his/her purchases, n (n>1) is the total number of stores visited by 
user u, and Xrorat,u (Xtotal,u = Xmax,u + Di Xi.) is the total number of products 
purchased by user u over all stores he/she visited. A high fidelity ratio means that 
user u buys most of his/her products at the same store, whereas a low fidelity ratio 
indicates that user u buys his/her products at different stores. When user u purchases 
all of his/her products at the same store (Ximax.u = Xtotal,u and n = 1), the fidelity 
ratio equals 1. It equals O when he/she purchases the same number of products at 
different stores. 

The K-means [14] and DBSCAN [15] algorithms were used to perform clustering. 
Here we present the results of DBSCAN, as the clusters provided by DBSACAN 
had less entity overlap than those provided by K-means. The main advantage of 
DBSCAN is that this density-based algorithm is able to capture clusters of any 
shape. 
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Fig. 1 Davies-Bouldin cluster validity index variation with respect to the number of clusters. 


We used the Davies-Bouldin (DB) [17] cluster validity index to determine the 
number of clusters in our dataset. The Davies-Bouldin index is the average similarity 
between each cluster C; for i = 1,...,k and its most similar counterpart C;. It is 
calculated as follows: 


DB-- max Rij, (3) 
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where R;; is the similarity measure between clusters calculated as (d; + d;)/6i;, 
where d; (dj) is the the mean distance between objects of cluster C; (C;) and the 
cluster centroid and ó;; is the distance between the centroids of clusters C; and C}. 

Figure 1 illustrates the variation of the Davies-Bouldin cluster validity index 
whose lowest (i.e. best) value was reached for our dataset with 6 clusters. The 
resulting clusters are represented in Figure 2. After performing the data clustering, 
we applied the t-SNE [18] dimensionality reduction method for data visualisation 
purposes. Since t-SNE is known to preserve the local structure of the data but not 
the global one, we used the PCA initialization parameter to mitigate this issue. 


o Cluster 1 
n Cluster 2 


Fig. 2 Clustering results : Clustering obtained with DBSCAN with the best number of clusters 
according to the Davies-Bouldin index. Data reduction was performed using t-SNE. The 6 clusters 
of customers found by DBSCAN are represented by different symbols. 


We have noticed that the users in Cluster 1 (see Fig. 2) are fairly sensitive to 
specials and have a high fidelity score, the users in Cluster 2 mostly purchase 
products on special in different stores, the users in Cluster 3 seem to be sensitive 
to the total price of their shopping baskets, Cluster 4 includes the users who are 
sensitive to specials but have a low fidelity score, Cluster 5 includes the users who 
are not very attracted by specials but are rather loyal to their favorite store(s), and 
the users in Cluster 6 tend to buy products on special and have high fidelity scores. 


3 Application of Supervised Machine Learning Methods 


To predict the products to be recommended for the current weekly basket, we used 
the following supervised machine learning methods: Decision Tree, Random Forest, 
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Gradient Boosting Tree, XGBoost, Logistic Regression, Catboost, Support Vector 
Machine and Naive Bayes. These methods were used through their scikit-learn 
implementations [16]. Due to the lack of large datasets we did not use deep learning 
models in our study. We decided to use these classical machine learning methods 
because they are usually recommended to work with smaller datasets contrary to 
their deep leaning counterparts. Also, deep leaning algorithms usually don't handle 
properly mixed types of features present in our data. Most of the methods we used 
are the ensemble methods, i.e. they use multiple replicates to reduce the variance. 
The F-score results provided by each method without (using all products available) 
and with clustering (using only the products purchased by the cluster members) are 
presented in Table 1. 

As shown in Table 1, Random Forest outperformed the other competing methods 
without and with data clustering, providing the average F-scores of 0.494 and 0.499 
(obtained over all users), respectively. Tree-based models relying on gradient boost- 
ing performed relatively well and could possibly give better results with a different 
data processing. We can also notice that all the methods, except CatBoost, benefited 
from the data clustering process. 


Table 1 F-scores provided by ML methods without and with clustering of MyGrocery Tour users. 


Machine learning methods Results without clustering Results with clustering 
CatBoost 0.438 0.438 
Decision Tree 0.463 0.468 
Gradient Boosting Tree 0.488 0.495 
Logistic Regression 0.474 0.478 
Naive Bayes 0.433 0. 436 
Random Forest 0.494 0.499 
SVM-RBF 0.392 0.397 
XGBoost 0.476 0.481 


4 Conclusion 


In this paper, we presented a novel recommender system that is intended to predict 
the content of the customer's weekly basket depending on his/her purchase history. 
Our system is also able to predict the store(s) where the purchase(s) will take place. 
The clustering step allowed us to identify customer profiles and to improve the F- 
score result for every tested machine learning model, except CatBoost. Using our 
methodology and the new data available on MyGrocery Tour, we were able to improve 
the F-score performance by the margin of 0.129, compared to the results obtained 
by Tahiri et al. [11]. Our model is able to predict products that will be purchased 
again or acquired for the first time by a given user, but it is not yet able to predict the 
optimal quantity for each product to be bought. Another important issue is how to 
provide plausible recommendations for customers without shopping history (i.e. the 
cold start problem). We will tackle these important issues in our future work. 
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COVID-19 Pandemic: a Methodological Model 
for the Analysis of Government’s Preventing 
Measures and Health Data Records 


Theodore Chadjipadelis and Sofia Magopoulou 


Abstract The study aims to investigate the associations between the government’s 
response measures during the COVID-19 pandemic and weekly incidence data (pos- 
itivity rate, mortality rate and testing rate) in Greece. The study focuses on the 
period from the detection of the first case in the country (26th February 2020) to the 
first week of 2022 (08th January 2022). Data analysis was based on Correspondence 
Analysis on a fuzzy-coded contingency table, followed by Hierarchical Cluster Anal- 
ysis (HCA) on the factor scores. Results revealed distinct time periods during which 
interesting interactions took place between control measures and incidence data. 


Keywords: hierarchical cluster analysis, correspondence analysis, COVID-19, 
evidence-based policy making 


1 Introduction 


The present study focuses on the period of the COVID-19 pandemic in Greece, from 
the detection of the first case of COVID-19 to the first week of 2022. This period 
can be divided into five distinct phases. The first phase extends from the beginning 
of 2020 until the first lockdown, i.e., from the first case reported in Greece until 
the end of the first quarantine period in May 2020. The second phase concerns the 
interim period from June to October 2020, when the pandemic indices improved, 
and policies were loosened for the opening of tourism. The third phase concerns the 
second lockdown and the evolution of the pandemic in the country from November 
2020 to April 2021, when the first vaccination period of the adult population took 
place. The fourth phase includes the interim period from May 2021 to October 
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2021, where a general stabilization of the number of cases occurred, while the last 
period refers to a significant increase in the number of cases from November 2021 
to January 2022. 

Overall, from March 2020 to January 2022, a total of 1,79 million cases of COVID- 
19 were recorded in Greece (Figure 1) and a total of 22,635 deaths. Vaccination 
coverage is as of January 2022 over 65% of country's population, i.e., 7,241,468 
fully vaccinated citizens. 


From Day 1: (26/2/20) moving average previous days 


Day 127: (1/7/20) open. 
Day 141: (15/7/20) open UK flights 

Day 147: (21/7/20) lock land borders. 

Day 155: (29/7/20) mandatory use of mask 
Day 202: (14/9/20) in person schools 

Day 256: (7/11/20) lock-down 

Day 346: (5/2/21) lockdown 

Day 433: (3/5/21) unlock open restaurants! 
Day 444: (14/5/21) unlock restrictive measures. 


MD uo 


Fig. 1 Record of cases of COVID-19 in Greece (March 2020-January 2022). 


In this study, a combination of multivariate data analysis methods was employed 
to analyze COVID-19-related data so as to assess the quality of decision-making 
outputs during the crisis and improve evidence-based decision-making processes. 
Section 2 presents the methodology and describes the data sources and the data 
analysis workflow. Section 3 presents the study results and Section 4 discusses the 
results and proposes methodological tools and presents the paper conclusions. 


2 Methodology 


2.1 Data 


For the study purposes, data were obtained from the Oxford Covid-19 Government 
Response Tracker (OXCGRT) and were combined with self-collected Covid-19 data 
for Greece [3] daily updated in Greek. The Oxford Covid-19 Government Response 
Tracker (OXCGRT) collects publicly available information reflecting government re- 
sponse from 180 countries since 1 January 2020 [4]. The tracker is based on data for 
23 indicators. In this study, two groups of indicators were considered: Containment & 
Closure and Health Systems in the case of Greece. The first group of indicators refers 
to "collective" level policies and measures, such as school closures and restriction in 
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mobility, while the second refers to “individual” level policies and measures, such as 
testing and vaccination. Specifically, the collective level indicators refer to policies 
taken by the governments' and reflect on a collective level on the society: school 
closing, workplace closing, cancelation of public events, restrictions on gathering, 
closure of public transport, stay at home requirements, internal movement restric- 
tions and international travel controls. The health system policies primarily touch 
upon the individual level and specifically refer to: public information campaigns, 
testing, contact tracing, healthcare facilities, vaccines' investments, facial coverings, 
vaccination and protection of the elderly people. All collective-level indicators (C1 to 
C8) were summed to yield a total score (ranging from 0 to 16). Similarly, individual- 
level indicators (H1 to H3 and H6 to H8) were summed to compute a total score 
(ranging from to 12). 

The self-collected data refer to positive cases, number of Covid-19-related deaths, 
number of tests and total number of vaccinations administered. These data have been 
recorded daily since March 2020 from public announcements by official and verified 
sources. A total of 94 time points were considered in the present study, corresponding 
to weekly data (Monday was used as a reference). Three quantitative indicators were 
derived, a positivity index (#cases / #tests), a mortality index (#deaths / cases) anda 
testing index (#tests / #people). The number of vaccinations is not used in the present 
study because the vaccination process began in January 2021 and the administration 
of the booster dose began in September 2021. The final data set consisted of five 
indicators: two ordinal total scores, and three quantitative indices. 


2.2 Data Analysis 


A four-step data analysis strategy was adopted. In the first step, the three quantitative 
variables (positivity rate, mortality rate and testing rate) were transformed into 
ordinal variables, via a method used in [7] (see Step 1) transformation of continuous 
variables into ordinal categorical variables, with minimum information loss. Three 
ordinal variables were derived. In the second step, the five ordinal variables (i.e., the 
three recoded variables and the two ordinal total scores), were fuzzy-coded into three 
categories each, using the barycentric coding scheme proposed in [7]. This scheme 
has been recently evaluated in the context of hierarchical clustering in [7] and was 
applied with the DIAS Excel add-in [6]. Barycentric coding allows us to convert an 
m-point ordinal variable into an n-point fuzzy-coded variable [6, 7]. In other words, 
the transformation of the three quantitative variables into ordinal variables resulted 
in a generalized 0-1 matrix (fuzzy-coded matrix), where for each variable we obtain 
the estimated probability for each category. A drawback of the proposed approach is 
that the ordinal information in the 5 ordinal variables is lost. 

The third step involved the application of Correspondence Analysis (CA) on 
the fuzzy-coded table with the 94 weeks as rows and the fifteen fuzzy categories 
as columns (see [1] for a similar approach). The number of significant axes was 
determined based on percentage of inertia explained and the significant points on each 
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axis were determined based on the values of two statistics that accompany standard 
CA output; quality of representation (COR) greater than 200 and contribution (CTR) 
greater than 1000/(n 4 1), where n is the total number of categories (i.e., 15 in our 
case). In the final step, Hierarchical Cluster Analysis (HCA) using Benzecri's chi- 
squared distance and Ward's linkage criterion [2, 8] was employed to cluster the 
94 points (weeks) on the CA axes obtained from the previous step. The number of 
clusters was determined upon the empirical criterion of the change in the ratio of 
between-cluster inertia to total inertia, when moving from a partition with r clusters 
to a partition with r — 1 cluster [8]. Lastly, we interpret the clusters after determining 
the contribution of each indicator to each cluster. All analyses were conducted with 
the M.A.D. [Méthodes de l'Analyse des Données] software [5]. 


3 Results 


Correspondance Analysis resulted in four significant axes, which explain 74.91% of 
the total inertia (Figure 2). For each axis, we describe the main contrast between 
groups of categories based on their coordinates, COR and CTR values (Figure 3). 
“Low and moderate mortality rates" and “high factor testing rates" define a pole on 
the 1st axis, which is opposed to "average and high levels of "individual" measures". 
On the second axis, “low positivity rate" and “average levels of collective measures" 
define a pole, while “average and high positivity rate" and “high levels of collective 
measures" define the opposite pole. The third axis is characterized by “moderate 
and high mortality rate", “high levels of collective measures" and “average levels 
of individual measures" that are opposed to "average levels of collective measures". 
On the fourth axis, “average levels of collective measures" are opposed to “average 
testing rate" and "high levels of collective measures". 


Total Inertia 0,62704 
Axis Inertia %lnertia Cumulative | Histogram 
1 0,1739028 27.73 27.73 ERAAAAKERRRAAAAARAKERRRARAAAARERERARAAAAAAE 
2 0,1136495 18,12 45,86 KXAXERAXEEKAXERKAEEREAEERER 
3 0,1066425 17.01 62.87 KXAXERAAXEERAXERRAEEREAAEER 
4 0,0755233 12,04 74.91 KRRAAAKKKKRRATAAAX 
5 0,0526040 8,39 83,30 Lose lc 
6 0,0401367 6,40 89,70 sick 
7 0,0307749 4,91 94.61 ee 
8 0,0163050 2,60 97,21 gari 
9 0,0113139 1,80 99.01 aes 
10 0,0061758 0,98 100,00 pe 
11 0,0000054 0,00 100,00 g 
12 0,0000036 0,00 100,00 E 


Fig. 2 Explained inertia by axis. 
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Fig. 3 Category coordinates on the four CA axes (#G), quality of representation (COR) and 
contribution (CTR). COR values greater than 200 and CTR values greater than 1000 / 16 = 62.5 
are shown in yellow. Positive coordinates are shown in green and negative in pink. 


Hierarchical Cluster Analysis on the factor scores resulted in seven clusters using 
the empirical criterion for cluster determination (see Section 2.2). The corresponding 
dendrogram is shown in Figure 4. The seven nodes in the figure that correspond to 
the seven clusters are 182, 181, 175, 177, 171, 181, 133 and 179. Cluster content 
reflects the different periods (phases) presented in the introductory section. 


0.0 50.0 100.0 


Fig. 4 Dendrogram of the HCA. 
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The first cluster (182) combines data points from March 2020, the onset of the 
pandemic with data points from a period following the summer of 2020 (October 
and November). This cluster is characterized by high positivity rate, low testing 
rate, high levels of “collective” measures (containment & closure) and low levels of 
"individual" measures (health system). The second cluster (181) contains data points 
from April and May 2020 and is characterized by low positivity rate, average to high 
mortality rate, low testing rate, high levels of “collective” measures (containment & 
closure) and average levels of *individual" measures (health system). The third clus- 
ter (175) combines summer months of 2020 and 2021. This cluster is characterized 
by low positivity rate, low testing rate and average levels of “collective” measures 
(containment & closure). The fourth cluster (177) marks the period of December 
2020 and the period of spring of 2021, with average positivity rate and high levels of 
"collective" measures (containment & closure). The fifth cluster (171) refers to the 
period from December 2020 to February 2021, but also includes August 2021, with 
high levels of “collective” measures (containment & closure). The sixth cluster (133) 
refers to the period following the summer of 2021 (September and October 2021). 
In this cluster, average positivity rates were observed but also strict containment and 
closure measures. 

Lastly, the seventh cluster (179) refers to November and December 2021, including 
also January 2022, with high positivity and high testing rates, while high levels of 
containment and closure and health system measures were observed. Figure 5 shows 
the contributions of each indicator in each cluster. 


: Cluster 

met Variabie tovel 133 171 175 177 179 181 182 
VAR01 casesłtests low 2,6888 3,6136 
VAR02 casestests average 8 884 8,884 
VARO3 |casestests high 1,9892 13,556 
VAR11 | deaths/cases low 
VAR12 |deaths/cases average 105652 
VAR13_|deaths/cases high 9.2727 
VAR21 testspeople low 1,9732 2,6465 2,3519 
VAR22 tests /people average 
VAR23 _tests/people high 21,3527 
VAR31 | containment & dosure low 
VAR32_| containment & dosure | average 6,8689 
VAR33 |containment& dosure | high 3,8875 2,2049 17047 2,5037 2,3608 1,971 
VAR41 health system low 10,3651 
VAR42 health system average 12,7955 
VAR43 health system high 1,8392 


Fig. 5 Cluster description (contribution values of the indicators in each cluster - node). 


4 Discussion 


Based on the study results, we can argue that, when it comes to measures and 
real time data following a situation such as the pandemic, “the chicken and egg" 
dilemma arises. The question is whether “collective” and "individual" measures 
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affect daily incidence data or the inverse (i.e., that the daily data lead to measures). 
We conclude that in fact the two should be perceived as working in conjunction 
and not independently from one another. The analysis showed that lower positivity 
rate is accompanied by average levels of measures from the government at both 
the "individual" and the "collective" level. Furthermore, higher positivity rate is 
accompanied by higher levels of measures, as a response. With regard to mortality 
rate, we observed that higher mortality invokes higher levels of “collective” measures 
and average levels of “individual” measures, whereas average levels of “collective” 
measures are associated with higher mortality rate. 

Itis therefore evident that when it comes to decision making in crisis situations, a 
systematic collection, analysis and use of data is linked to more effective government 
response overall. Therefore, evidence-based policy making should be linked to crisis 
management. This paper presents a first attempt to capture an ongoing phenomenon 
and therefore it is crucial that the collection and analysis of data will be complemented 
until the end of the phenomenon. 
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pcTVI: Parallel MDP Solver Using a 
Decomposition into Independent Chains 


Ja&l Champagne Gareau, Éric Beaudry, and Vladimir Makarenkov 


Abstract Markov Decision Processes (MDPs) are useful to solve real-world proba- 
bilistic planning problems. However, finding an optimal solution in an MDP can take 
an unreasonable amount of time when the number of states in the MDP is large. In 
this paper, we present a way to decompose an MDP into Strongly Connected Com- 
ponents (SCCs) and to find dependency chains for these SCCs. We then propose a 
variant of the Topological Value Iteration (TVI) algorithm, called parallel chained 
TVI (pcTVD), which is able to solve independent chains of SCCs in parallel lever- 
aging modern multicore computer architectures. The performance of our algorithm 
was measured by comparing it to the baseline TVI algorithm on a new probabilistic 
planning domain introduced in this study. Our pcTVI algorithm led to a speedup 
factor of 20, compared to traditional TVI (on a computer having 32 cores). 


Keywords: Markov decision process, automated planning, strongly connected com- 
ponents, dependancy chains, parallel computing 


1 Introduction 


Automated planning is a branch of Artificial Intelligence (AI) aiming at finding 
optimal plans to achieve goals. One example of problems studied in automated 
planning is the electric vehicle path-planning problem [1]. Planning problems with 
non-deterministic actions are known to be much harder to solve. Markov Decision 
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Processes (MDPs) are generally used to solve such problems leading to probabilistic 
models of applicable actions [2]. 

In probabilistic planning, a solution is generally a policy, i.e., a mapping specifying 
which action should be executed in each observed state to achieve an objective. 
Usually, dynamic programming algorithms such as Value Iteration (VI) are used to 
find an optimal policy [3]. Since VI is time-expensive, many improvements have 
been proposed to find an optimal policy faster, using for example the Topological 
Value Iteration (TVI) algorithm [4]. However, very large domains often remain out 
of reach. One unexplored way to reduce the computation time of TVI is by taking 
advantage of the parallel architecture of modern computers and by decomposing an 
MDP into independent parts which could be solved concurrently. 

In this paper, we show that state-of-the-art MDP planners such as TVI can run 
an order of magnitude faster when considering task-level parallelism of modern 
computers. Our main contributions are as follows: 


e Animproved version of the TVI algorithm, parallel-chained TVI (pcT VI), which 
decomposes MDPs into independent chains of strongly connected components 
and solves them concurrently. 

* Anew parametric planning domain, chained-MDP, and an evaluation of pcTVI's 
performance on many instances of this domain compared to the VI, LRTDP [5] 
and TVI algorithms. 


2 Related Work 


Many MDP solvers are based on the Value Iteration (VI) algorithm [3], or more 
precisely on asynchronous variants of VI. In asynchronous VI, MDP states can be 
backed up in any order and do not need to be considered the same number of times. 
One way to take advantage of this is by assigning a priority to every state and by 
considering them in priority order. 

Several state-of-the-art MDP algorithms have been proposed to increase the speed 
of computation. Many of them are able to focus on the most promising parts of MDP 
through heuristic search algorithms such as LRTDP [5] or LAO* [6]. Some other 
MDP algorithms use partitioning methods to decompose the state-space in smaller 
parts. For example, the P3VI (Partitioned, Prioritized, Parallel Value Iteration) al- 
gorithm partitions the state-space, uses a priority metric to order the partitions in an 
approximate best solving order, and solves them in parallel [7]. The biggest disad- 
vantage of P3VI is that the partitioning is done on a case-by-case basis depending on 
the planning domain, i.e., PS VI does not include a general state-space decomposition 
method. The inter-process communication between the solving threads also incurs 
an overhead on the computation time. The more recent TVI (Topological Value Iter- 
ation) algorithm [4] also decomposes the state-space, but does it by considering the 
topological structure of the underlying graph of the MDP, making it more general 
than P3VI. Unfortunately, to the best of our knowledge, no parallel version of TVI 
has been proposed in the literature. 
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3 Problem Definition 


There exist different types of MDP, including Finite-Horizon MDP, Infinite-Horizon 
MDP and Stochastic Shortest Path MDP (SSP-MDP) [2]. The first two of them can 
be viewed as special cases of SSP-MDP [8]. In this work, we focus on SSP-MDPs, 
which we describe formally in Definition 1 below. 


Definition 1 A Stochastic Shortest Path MDP (SSP-MDP) is given by a tuple 
(S, A, T, C, G), where: 


e Sisa finite set of states; 

* Aisa finite set of actions; 

e T: SxAxS [0,1] is a transition function, where T (s, a, s^) is the probability 
of reaching state s’ when applying action a while in state s; 

e C: Sx A  R* isa cost function, where C(s, a) gives the cost of applying the 
action a while in state 5; 

e G C Sis the set of goal states (which can be assumed to be sink states). 


We generally search for a policy 7: S — A that tells us which action should be 
executed at each state, such that an execution following the actions given by z until 
a goal is reached has a minimal expected cost. This expected cost is given by a value 
function V^ : S — R. The Bellman Optimality Equations are a system of equations 
satisfied by any optimal policy. 


Definition 2 The Bellman Optimality Equations are the following: 


0 ifs €G, 


VO 7 min [eG a) + x, TG as sovi]|, otherwise. 
ac s'eS 


The expression between square brackets is called the Q-value of a state-action pair: 


Q(s,a) = C(s,a) + by T(s,a,s’)V(s). 


s'es 


When an optimal value function V* has been computed, an optimal policy z* 
can be found greedily: 


z*(s) = argmin, AQ" (s, a). 


Most MDP solvers are based on dynamic programming algorithms like Value 
Iteration (VI), which update iteratively an arbitrarily initialized value function until 
convergence with a given precision e. In the worst case, VI needs to do |S| sweeps of 
the state space, where one sweep consists in updating the value estimate of every state 
using the Bellman Optimality Equations. Hence, the number of state updates (called 
a backup) is O(|S|). When the MDP is acyclic, most of these backups are wasteful, 
since the MDP can in this situation be solved using only |S| backups (ordered in 
reverse topological order), thus allowing one to find an optimal policy in O(|S]) [8]. 
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4 Parallel-chained TVI 


In this section, we describe an improvement to the TVI algorithm, named pcTVI 
(Parallel-Chained Topological Value Iteration), which is able to solve an MDP in 
parallel (as P3VI). pcTVI uses the decomposition proposed by TVI, known to give 
good performance on many planning domains. We start by summarizing how the 
original TVI algorithm works. 

First, TVI uses Kosaraju's graph algorithm on a given MDP to find the strongly 
connected components (SCCs) of its graphical structure (the graph corresponding 
to its all-outcomes determinization).The SCCs are found by Kosaraju's algorithm 
in reverse topological order, which means that for every i < j, there is no path 
from a state in the i" SCC to a state in the j'" SCC. This property ensures that 
every SCC can be solved separately by VI sweeps if previous SCCs (according to 
the reverse topological order) have already been solved. The second step of TVI is 
thus to solve every SCC one by one in that order. Since TVI divides the MDP in 
multiple subparts, it maximizes the usefulness of every state backup by ensuring 
that only useful information (i.e., converged state values) is propagated through the 
state-space. 

Unfortunately, TVI can only solve one SCC at a time. Since modern computers 
have many computing units (cores) which can work in parallel, we could theoretically 
solve many SCCs in parallel to greatly reduce computation time. Instead of choosing 
SCCS to solve in parallel arbitrarily or using a priority metric (as in P3VI), which 
incur a computational overhead to propagate the values between the threads, we 
want to consider their topological order (as in TVI) to minimize redundant or useless 
computations. One way to share the work between the processes is to find independent 
chains of SCCs which can be solved in parallel. The advantage of independent chains 
is that no coordination and communication is needed between the SCCs, which both 
removes some running-time overhead and simplifies the implementation. 

The Parallel-Chained TVI algorithm we propose (Algorithm 1) works as follows. 
First, we find the graph G corresponding to the graphical structure of the MDP, 
decompose it into SCCs, and find the reverse topological order of the SCCs (as in 
TVI, but we use Tarjan’s algorithm instead of Kosaraju's algorithm since it is about 
twice as fast). We then build the condensation of the graph G, i.e., the graph Ge 
whose vertices are SCCs of G, where an edge is present between two vertices scc, 
and scc» if there exists an edge in G between a state sı € scc; and a state s2 € scc». 
We also store the reversed edges in Ge and a counter Csce on every vertex scc which 
indicates how many incoming neighbors have not yet been computed. We use this 
(usually small) graph Ge to detect which SCCs are ready to be considered (the SCCs 
whose incoming neighbors have all been determined with precision e, i.e., the SCCs 
whose associated counter Cscc is 0). When a new SCC is ready, it is inserted into a 
work queue from which the waiting threads acquire their next task. 
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Algorithm 1 Parallel-Chained Topological Value Iteration 


1: procedure pcT VI(M: MDP, t: Number of threads) 
2 » Find the SCCs of M 
3 G — GnarH( M) » G implicitly shares the same data structures as M 
4: SCCs — TARJAN(G) > SCCs are found in reverse topological order 
5: 
6: » Build the graph of SCCs of G 
T: Gec — GnaPHCONDENSATION(G, SCC s) 
8: 
9: » Solve in parallel independent SCCs 
10: Pool — CREATETHREADPOOL(f) > Create f threads 
11: V e NEWVALUEFUNCTION() > Arbitrarily initialized; Shared by all threads 
12: Q — CREATEQUEUE() » Shared by all threads 
13: InsertT(Q, HEAD(SCCs)) > The goal SCC is inserted in the queue 
14: while NotEmpry(Q) do > Only one thread runs this loop 
15: scc «— ExrRAcTNExTITEM( Q) 
16: for all neighbor € NricuBons(scc) do 
17: Decrement NUMINCoMINGNEIGHBORS (neighbor) 
18: if NumINcoMINGNEIGHBoRS(neighbor) = 0 then 
19: ASSIGNTASKTOAVAILABLETHREAD( Pool, PARTIALVI(M, V, scc)) 
20: Pusu(Q, scc) > Neighbors of scc are ready to be considered next 
21: end if 
22: end for 
23: end while 
24: 
25: > Compute and return an optimal policy using the computed value function 


26: II — GnzEpyPonicv(V ) 
27: return II 
28: end procedure 


5 Empirical Evaluation 


In this section, we evaluate empirically the performance of pcTVI, comparing it to the 
three following algorithms: (1) VI — the standard dynamic programming algorithm 
(here we use its asynchronous round-robin variant), (2) LRTDP — a well-known 
heuristic search algorithm, and (3) TVI — the Topological Value Iteration algorithm 
described in Section 4. In the case of LRTDP, we carried out the admissible and 
domain-independent min heuristic, first described in the original paper introducing 
LRTDP [5]: 


0, ifs EG. 
Pimin(s) = min [C(s, a)+ min vs], otherwise, 
acAs s'esucca(s) 


where A, denotes the set of applicable actions in state s and succa(s) is the set of 
successors when applying action a at state s. The four competing algorithms (VI, 
TVI, LRTDP and pcTVI) were implemented in C++ by the authors of this paper and 
compiled using the GNU g++ compiler (version 11.2). All tests were performed on a 
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computer equipped with four Intel Xeon E5-2620V4 processors (each of them having 
8 cores at 2.1 GHz, for a total of 32 cores). For every test domain, we measured 
the running time of the four compared algorithms carried out until convergence to 
an e-optimal value function (we used e = 1079). Every domain was tested 15 times 
with randomly generated MDP instances. To minimize random factors, we report 
the median values obtained over these 15 MDP instances. 

Since there is no standard MDP domain in the scientific literature suitable to 
benchmark a parallel MDP solver, we propose a new general parametric MDP 
domain that we use to evaluate the algorithms. This domain, which we call chained- 
MDP, uses 5 parameters: (1) k, the number of independent chains (c1, c2, ..., cx) in 
the MDP; (2) nsec, the number of SCCs (scci,1, sccj,2,..., SCCi,n,,.. } in every chain 
Ci; (3) nsps, the number of states per SCC; (4) na the number of applicable actions 
per state, and (5) ne the number of probabilistic effects per action. The possible 
successors succ(s) of a state s in scc; j are states in scc; j and either the states 
in scc; j+1 if it exists, or the goal state otherwise. When generating the transition 
function of a state-action pair (5, a), we sampled ne states uniformly from succ(s) 
with random probabilities. In each of our tests, we used nsce = 2, Nna = 5and n, = 5. 
A representation of a Chained-MDP instance is shown in Figure 1. 


Fig. 1 A chained-MDP instance where ne = 3 and nsce = 4. Each ellipse represents a strongly 
connected component. 


Figure 2 presents the obtained results for the Chained-MDP domain when varying 
the number of states and fixing the number of chains (32). We can observe that when 
the number of states is small, pcT VI does not provide an important advantage over 
the existing algorithms since the overhead of creating and managing the threads is 
taking most of the possible gains. However, as the number of states increases, the gap 
in the running time between pcTVI and the three other algorithms increases. This 
indicates that pcTVI is particularly useful on very large MDPs, which are usually 
needed when considering real-world domains. 

Figure 3 presents the obtained results for the same Chained-MDP domain when 
varying the number of chains and fixing the number of states (1M). When the 
number of chains increases, the total number of SCCs implicitly increases (which 
also implies the number of states per SCC decreases). This explains why each tested 
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Fig. 2 Average running times (in s) for the Chained-MDP domain with varying number of states 
and fixed number of chains (32). 
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Fig. 3 Average running times (in s) for the Chained-MDP domain with varying number of chains 
and fixed number of states (1M). 


algorithms becomes faster (TVI becomes faster by design, since it solves SCCs 
one-by-one without doing useless state backups, and VI and LRTDP become faster 
due to an increased locality of the considered states in memory, which improves 
cache performance). The performance of pcTVI increases as the number of chains 
increases (for the same reason as the others algorithms, but also due to increased 
parallelization opportunities). We can also observe that for domains with 4 chains 
only, pcTVI still clearly outperforms the other methods. This means that pcTVI does 
not need a highly parallel server CPU and can be used on standard 4-core computer. 
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6 Conclusion 


The main contributions of this paper are two-fold. First, we presented a new algo- 
rithm, pcTVI, which is, to the best of our knowledge, the first MDP solver that takes 
into account both the topological structure of the MDP (as in TVI) and the parallel 
capacities of modern computers (as in P3VI). Second, we introduced a new para- 
metric planning domain, Chained-MDP, which models any situation where different 
strategies (corresponding to a chain) can reach a goal, but where, once committed 
to a strategy, it is not possible to switch to a different one. This domain is ideal to 
evaluate the parallel performance of an MDP solver. Our experiments indicate that 
pcTVI outperforms the other competing methods (VI, LRTDP, and TVI) on every 
tested instance of the Chained-MDP domain. Moreover, pcTVI is particularly effec- 
tive when the considered MDP has many SCC chains (for increased parallelization 
opportunities) of large size (for decreased overhead of assigning small tasks to the 
threads). As future work, we plan to investigate ways of pruning provably subopti- 
mal actions, which would allow more SCCs to be found. While this paper focuses 
on the automated planning side of MDPs, the proposed optimization and parallel 
computing approaches could also be applied when using MDPs with Reinforcement 
Learning and other ML algorithms. 
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Three-way Spectral Clustering 


Cinzia Di Nuzzo and Salvatore Ingrassia 


Abstract In this paper, we present a spectral clustering approach for clustering 
three-way data. Three-way data concern data characterized by three modes: n units, 
p variables, and ¢ different occasions. In other words, three-way data contain a t X p 
observed matrix for each statistical observation. The units generated by simultaneous 
observation of variables in different contexts are usually structured as three-way data, 
so each unit is basically represented as a matrix. In order to cluster the n units in K 
groups, the spectral clustering application to three-way data can be a powerful tool 
for unsupervised classification. Here, one example on real three-way data have been 
presented showing that spectral clustering method is a competitive method to cluster 
this type of data. 


Keywords: spectral clustering, kernel function, three-way data 


1 Introduction 


Spectral clustering methods are based on the graph theory, where the units are 
represented by the vertices of an undirected graph and the edges are weighted by 
the pairwise similarities coming from a suitable kernel function, so the clustering 
problem is reformulated as a graph partition problem, see e.g. [16, 6]. The spectral 
clustering algorithm is a very powerful method for finding non-convex clusters of 
data, moreover, it is a handy approach for handling high-dimensional data since it 
works on a transformation of the raw data having a smaller dimension than the space 
of the original data. 


Cinzia Di Nuzzo (È<) 
Department of Statistics, University of Roma La Sapienza, Piazzale Aldo Moro, 5, 00185 Roma, 
Italy, e-mail: cinzia.dinuzzo@uniromal.it 


Salvatore Ingrassia 
Department of Economics and Business, University of Catania, Piazza Università, 2, 95131 Catania, 
Italy, e-mail: s. ingrassia@unict.it 


€ The Author(s) 2023 111 
P. Brito et al. (eds.), Classification and Data Science in the Digital Age, 

Studies in Classification, Data Analysis, and Knowledge Organization, 
https://doi.org/10.1007/978-3-031-09034-9 13 


112 C. Di Nuzzo and S. Ingrassia 


Three-way data derives from the observation of various attributes measured on a 
set of units in different situations; some examples are longitudinal data on multiple 
response variables and multivariate spatial data. Three-way data can also derive 
from temporal measurements of a feature vector, thus having the dataset composed 
of three modes: n units (matrices), p variables (columns), and t times (rows). Clus- 
tering of three-way data has attracted a growing interest in literature, see e.g. [14], 
[1]; model-based clustering of three-way data has been introduced by [15] in the 
framework of matrix-variate normal mixtures; recent papers include [9] handle on 
parsimonious models for modeling matrix data; [11] introduce two matrix-variate 
distributions, both the elliptical heavy-tailed generalization of the matrix-variate 
normal distribution; [12] deal with three-way data clustering using matrix-variate 
cluster-weighted models (MV-CWM); and, [13] consider an application to educa- 
tional data via mixtures of parsimonious matrix-normal distribution. 

In this paper, we present a spectral clustering approach for clustering three-way 
data and a suitable kernel function between matrices is introduced. As a matter of 
fact, the data matrices represent the vertices of the graph, consequently, the edges 
must be weighted by a single value. 

The rest of the paper is organized as follows: in Section 2 the spectral clustering 
method is summarized; in Section 3 a method to select the parameters in the spectral 
clustering algorithm is described; in Section 4 the three-way spectral clustering with 
a new kernel function are introduced; in Section 5 an application based on real 
three-way data is presented. Finally, in Section 5 we provide concluding remarks. 


2 Spectral Clustering 


Spectral clustering algorithm for two-way data has been described in [8, 16, 6]. Here, 
we summarize the main step of this algorithm. 

Let V = (x1, x2, ..., x4) be a set of points in X € RP. In order to group the data 
V in K cluster, the first step concerns the definition of a symmetric and continuous 
function K : X x X — [0, œ) called the kernel function. Afterwards, a similarity 
matrix W = (wij) can be assigned by setting wi; = k(xi, xj) 2 0, for xi, xj € X. 
and finally the normalized graph Laplacian matrix Lsym € R"*" is introduced 


Lsym H=l=—D wD, (1) 


where D = diag(d,, do,...,dy) is the degree matrix and d; is the degree of the 
vertex x; defined as d; = X, ji Wij and / denotes the n x n identity matrix. The 
Laplacian matrix Lsym is positive semi-definite with n non-negative eigenvalues. For 
a fixed K « n, let (y,,..., y g } be the eigenvectors corresponding to the smallest K 
eigenvalues of Lsym. Then, the normalized Laplacian embedding in the K principal 
subspace is defined as the map Or : {x1,...,Xn} — RF given by 


Or(xi) 2(yi.....Yki. i-L....n 
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where y1i,..., YK; are the i-th components of y,,..., yy, respectively. In other 
words, the function ®r(-) maps the data from the input space X to a feature space 
defined by the K principal subspace of Lsym. Afterwards, let Y = (y},...,y;,) be the 
nx K matrix given by the embedded data in the feature space, where y; = ®r(x;) for 
i=1,...,n. Finally, the embedded data Y are clustered according to some clustering 
procedure; usually, the k-means algorithm is taken into account in literature. How- 
ever, to this end Gaussian mixtures have been proposed because they yield elliptical 
cluster shapes, i.e. more flexible cluster shapes with respect to the k-means, see [2]. 
Finally, we point out that the performances of other mixture models based on non- 
Gaussian component densities have been analyzed, but Gaussian mixture models 
can be considered as a good trade-off between model simplicity and effectiveness, 
see [3] for details. 


3 A Graphical Approach for Parameter Selection 


According to spectral clustering algorithm introduced in Section 2, the spectral 
approach requires to set: i) the number of clusters K, ii) the kernel function «x (with 
the corresponding parameter). In order to select these quantities, in the following we 
summarize the method proposed in [4]. 

To begin with, we point out that the choice of the kernel function affects the entire 
data structure in the graph, and consequently, the structure of the Laplacian matrix 
and its eigenvectors. An optimal kernel function should lead to a similarity matrix 
W having (as much as possible) diagonal blocks: in this case, we get well-separated 
groups and we are also able to understand the number of groups in that data set 
by counting the number of blocks. For the sake of simplicity, we consider here the 
self-tuning kernel introduced by [17] 


lx; — x jl? 
K(xi, Xj) = exp | -———— — (2) 
Ej Ej 
with e; = ||x; — xn l|, where x; is the h-th neighbor of point x; (similarly for e;). 


This function allow to get a similarity matrix that does not depend on any parameter 
so that the algorithm of spectral clustering will be based on the pairwise proximity 
between units. On the contrary, we need to select the h-th neighbor of the unit in (2). 

The main novelty of the joint-graphical approach concerns the analysis of some 
graphic features of the Laplacian matrix including the shape of the embedded space. 
Indeed, the embedded data provide useful information for the clustering, in particular 
the main results in [10] and [5] allow to deduce that if the embedded data assume a 
cones structure, then the number of clusters is equal to the number of the cones/spikes 
in the feature space; furthermore, a clearer clustering structure emerges when the 
spikes are narrower and well separated. 

The idea behind the graphical approach is to select the number K of groups and the 
parameter h in the kernel function from a joint analysis of three main characteristics: 
the plot of the Laplacian matrix; the maxima values of the eigengaps between two 
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consecutive eigenvalues; the scatter plot of the mapped data in the feature space and 
in particular the number of spikes counted in the embedded data space. 

We remark that we cannot analyze all possible values of h € (1,2,...,n — 1) and 
hence we choose a suitable subset H C (1,2,...,n — 1}, in particular we choose 
H = (196, 2%, 5%, 10%, 15%, 20%} x n C (1,2, ...,n — 1}, and select h € H, see 
the following procedure for details. 


Parameter selection (K and h) 
Input: data set V, kernel function x, H. 


1. For each h in H, compute the matrix M; and analyze the block structure in the 
greyscale plot of M;. 

2. For each h in H, plot the embedded data in the feature space and analyze the 
shape of the cone structure. 

3. If the number of blocks in Step 1 is equal to the number of spikes in Step 2, then 
set K equal to the number of blocks. Go to Step 5. 

4. Otherwise, analyze the eigengap plot. 


a. If this plot shows a unique maximum eigengap for each h € H, then set K 
according to this maximum. Go to Step 5. 

b. If this plot shows multiple maxima for different h € 71, select the number 
of clusters K not to be smaller than the number of tight spikes in the 
corresponding plot of the embedded data. 


5. Select h € H such that the clearest orthogonal data structure emerges from the 
plot of the embedded data. 
6. Stop. 


Output: K, h. 


4 Three-way Spectral Clustering 


In this section, we propose a spectral approach for clustering three-way data. Three- 
way data consists of a data set referring to the same sets of units and variables, 
observed in different situations, i.e., a set of multivariate matrices, that can be 
organized in three modes: n units, p variables, and t situations. Therefore, given 
n matrices that represent the vertices of the graph, each matrix is composed by p 
columns that represent our variables and ¢ rows that represent the time or another 
feature. So we have a tensor of dimension nx t x p, thus the dataset is a tensor {X ysk 
fori=1,...,n,s=1,...,t,k =1,...,p. 

We define a distance function óm between two matrices A, B € RP% such that 
ôm : RP x R'*P — [0, +00) is defined as 
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ôm (A, B) := ||A - Bllr = (3) 


where || - ||r is Frobenius norm!. Thus the distance between two units in the matrix 
data X is equal to 


t P 


Ya Xesf. — forini-is...n. (4) 


s=1 k=1 


ôm (Xi sk; Xpsk) = 


For simplicity, in the following, we denote ôm (Xi, sk, Xi, sk) by ôm (i1, i2). Moreover, 
we define the three-way self-tuning kernel function as 


(5) 


Ks : X x X > [0, +00), Ks(i1, i2) = exp [Pte 


€i, €i; 


where ej, and ej, need to be selected like in the kernel defined in (2). 

Afterwards, we compute the similarity matrix W given by Wii = k(i1, i2), so that 
we can apply the spectral clustering algorithm. 

Finally, we point out that, differently from approaches based on mixtures of 
matrix-variate data, the number of variables of the data set is not a critical issue 
because the spectral clustering algorithm is based on distance measures. 


5 A Real Data Application 


We apply the three-way spectral clustering to the analysis of the Insurance data set, 
available in the splm R package. This dataset was initially introduced by [7] and 
has recently been analyzed by [12]. The goal is to study the consumption of non-life 
insurance during the years 1998-2002 in the 103 Italian provinces, so t = 5 and 
n = 103. As regards the number of variables, we consider all the variables contained 
in the data set, so p = 11. Thus, we have 103 matrices of dimensions 5 x 11. 

The 103 Italian provinces are divided into north-west (24 provinces), north- 
east (22 provinces), center (21 provinces), south (23 provinces), and islands (13 
provinces). 

As regard the choice of K and h, we consider the graphical approach introduced 
in Section 3. In Figure | the geometric features of spectral clustering are plotted 
as h varies. From the number of blocks of the Laplacian matrix (Figure 1-a)), the 
first maximum eigengap (Figure 1-b)) and the number of spikes in the feature space 
(Figure 1-c)), we deduce that the number of clusters is K = 2. For the selection of 


1 In general, given a matrix A € R'", with A = (a;;) fori = 1,...,n and j = 1,...,m. The 


Frobenius norm is defined by 
m n 
Alle :- 4] >) > lai. 
j=l i=l 
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Fig. 1 Insurance data. Spectral clustering features 
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Table 1 Insurance data. Table of spectral clustering result. 


NORTHWEST (24 provinces) 
Cluster 1 NORTH EAST (22 provinces) 
CENTRE (15 provinces) 
CENTRE (6 provinces) 
Cluster 2 SOUTH (23 provinces) 
ISLANDS (13 provinces) 


h we choose indifferently h = 15 and h = 21 because in these cases the maximum 
eigengap highlights the maximum values corresponding to K = 2. In Table 1 the 
clustering results are presented. This table shows that only 6 center provinces are 
classified together with the southern provinces. But to be sure that these provinces 
are neighboring the south provinces, let us analyze spectral clustering results on the 
map of Italy. Figure 2-a) illustrates the partition deriving from spectral clustering 
in the political map of Italy, where Italian regions are described by the yellow lines, 
while the provinces are by the black lines. The result shows a clear separation 
between center-north Italy and south-insular Italy, in fact, the center-north has a 
level of insurance penetration close to the European averages, while the South is 
less developed economically. However, the Massa-Carrara province should belong 
to the centre-north group. Moreover, we remark that the Rome province, being the 
capital of Italy, has one socio-economic development comparable to that of north 
Italy justifying belonging to the centre-north group. 

Furthermore, in Figure 2-b) we also represented the partition produced by MN- 
CWM proposed in [12], we note that the two clustering results are very similar to 
each other and differ only for one province of central Italy (precisely for the province 
of Terni). It should also be emphasized that the dataset analyzed by [12] is different 
from the one analyzed here, since, to avoid excessive parameterization of the models, 
the authors select only p = 5 variables in the data set. 


3 LI 


a) b) 


Fig. 2 Insurance data. a) Three-way spectral clustering; b) Method proposed by [12]. 
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6 Conclusion 


In this paper, a spectral approach to cluster three-way data has been proposed. So 
the data are organized in a tensor and the vertices in the graph are represented by 
the matrices of dimension t x p. In order to weigh the matrices in the graph, a 
kernel function based on the Frobenius norm between the matrix difference has been 
introduced. The performance of the spectral clustering algorithm has been shown in 
one real three-way data set. Our method is competitive with respect to other clustering 
methods proposed in the literature to perform matrix-data clustering. Finally, in order 
to provide suggestions for future research, other kernel functions can be introduced 
considering different distances with respect to the Frobenius norm. 
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Improving Classification of Documents by 
Semi-supervised Clustering in a Semantic Space 


Jasminka Dob&a and Henk A. L. Kiers 


Abstract In the paper we propose a method for representation of documents in a 
semantic lower-dimensional space based on the modified Reduced k-means method 
which penalizes clusterings that are distant from classification of training documents 
given by experts. Reduced k-means (RKM) enables simultaneously clustering of 
documents and extraction of factors. By projection of documents represented in the 
vector space model on extracted factors, documents are clustered in the semantic 
space in a semi-supervised way (using penalization) because clustering is guided by 
classification given by experts, which enables improvement of classification perfor- 
mance of test documents. 

Classification performance is tested for classification by logistic regression and sup- 
port vector machines (SVMs) for classes of Reuters-21578 data set. It is shown that 
representation of documents by the RKM method with penalization improves the 
average precision of classification by SVMs for the 25 largest classes of Reuters 
collection for about 5,5% with the same level of average recall in comparison to 
the basic representation in the vector space model. In the case of classification by 
logistic regression, representation by the RKM with penalization improves average 
recall for about 1% in comparison to the basic representation. 
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1 Introduction 


There are two main families of methods that deal with representation of documents 
and words that index them: global matrix factorization methods such as Latent Se- 
mantic Analysis (LSA) [2] and local context window methods such as the continuous 
bag of words (CBOW) model and the continuous skip-gram model [8]. The latter 
use neural networks for learning of representations of words and are intensively 
explored lately in the scientific community since the development of fast processors 
has enabled processing of huge amounts of data which resulted in improvements in 
performance of wide spectra of text mining and natural language tasks. However, 
representation of words solely by context window methods has a drawback due to 
the neglect of information about global corpus statistics [9]. 

In this paper we propose a method for representation of documents by application 
of a penalized version of the RKM method [4] on a term-document matrix. The 
corpus of textual documents is represented by a sparse term-document matrix in 
which entry (i, j) is equal to the weight of the i-th index term for the j-th document. 
Weights of terms are given by the TfIdf weighting which utilizes local information 
about the frequency of the i-th term in the j-th document and global information about 
usage of the i-th term in the entire collection. A benchmark method that utilizes global 
matrix factorization on term-document matrices is LSA [2] which uses truncated 
singular value decomposition (SVD) for representation of terms and documents in 
lower-dimensional semantic space. SVD does not capture the clustering structure of 
data which motivates application of the RKM. 

The rest of the paper is organized as follows: the second section describes related 
work on representation of documents and words and methods of dimensionality 
reduction related to RKM. The third section describes the modified RKM method 
with penalization, while the fourth section describes an experiment on Reuters-21578 
data set. In the last section conclusions and directions for further work are given. 


2 Related Work 
2.1 Representation by Matrix Factorization Methods 


A benchmark method among methods that utilize matrix factorization for repre- 
sentation of textual documents is the method of LSA introduced in 1994 [2]. By 
LSA a sparse term-document matrix is transformed via SVD into a dense matrix 
of the same term-document type with representations of words (index terms) and 
documents in a lower-dimensional space. The idea is to map similar documents, or 
those that describe the same topics, closer to each other regardless of the terms that 
are used in them. A very efficient application of LSA is in cross-lingual information 
retrieval where relevant documents for a query in one language are retrieved from a 
set of documents in another language [7]. According to our knowledge application 
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of methods that simultaneously cluster objects and extract factors in the field of text 
mining is very limited. In [6] a method is proposed for cross-lingual information 
retrieval based on the RKM method. 


2.2 Neural Network Word Embeddings 


Another approach is to learn representations of words, or so called embeddings, by 
using local context windows. In 2003 Bengio and coauthors [1] proposed a neural 
probabilistic language model that uses simple neural network architecture to learn 
distributed representations for each word as well as probability functions for word 
sequences, expressed in terms of these representations. Mikolov and coautors [8] 
proposed in 2013 two models based on single-layer neural network architectures: 
the skip gram-model that predicts context words given the current word and the 
continuous bag of words model which predicts current words based on the context. 
In 2014 the GloVe model [9] was proposed, based on the critique that neural network 
models suffer from the disadvantage that they do not utilize co-occurrence statistics 
of the entire corpus, but scan only context windows of words ignoring vast amounts 
of repetition in the data. That model exploits the advantages of global matrix factor- 
ization methods by utilization of term-term co-occurrence matrices and local context 
window methods. 

Word embedding can be classified as static such as word2vec [8] and GloVe 
[9], and contextual, such as ELMo [10] and BERT [5]. Contextual representation 
is introduced in [10] in order to model characteristics of word use (syntax and 
semantics) on one side and variation in word representation due to the context in 
which words are appearing. 


2.3 Methods for Simultaneous Clustering and Factor Extraction 


A standard procedure for clustering of objects in a lower-dimensional space is tandem 
analysis which includes projection of data by principal components and clustering 
of data in a lower-dimensional space. Such an approach was criticized in [3] and 
[4] since principal components may extract dimensions which do not necessarily 
significantly contribute to the identification of a clustering structure in the data. 
As a response, De Soete and Carroll proposed the method of RKM [4] which 
simultaneously clusters data and extracts the factors of variables by reconstructing 
the original data with only centroids of clusters in a lower-dimensional space. The 
algorithm of Factorial k-means (FKM) proposed by Vichi and Kiers [13] has the 
same aim of simultaneous reduction of objects and variables and it reconstructs the 
data in a lower-dimensional space by its centroids in the same space. The application 
of the latter method is limited in text mining since the method is limited to cases in 
which the number of variables is less than the number of cases. In [11] the RKM 
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and FKM methods are compared using simulations and theoretically in order to 
identify cases for their application. Timmerman and associates also propose method 
of Subspace k-means [12] which gives an insight into cluster characteristics in terms 
of relative positions of clusters given by centroids and the shape of the clusters given 
by within cluster residuals. 


3 Reduced k-Means with Penalization 


Let X be m x n term-document matrix. We use the following notation: 


e Aisan m x k columnwise orthonormal matrix of extracted factors; 

e Misan n x c membership matrix, where c is a predefined number of clusters; 
Mic = 1 if object (document) i belongs to cluster c and 0 otherwise; 

e Yisacx k matrix which gives centroids of clusters in the lower-dimensional 
space. 


By definition, we suppose that every document in the collection belongs to exactly 
one cluster. The RKM method minimizes the loss function 


F(M,A) = [X - AY’ M' |? (1) 


in the least squares sense. The dimension of the lower-dimensional space must be 
less or equal to the number of clusters. Modified RKM with penalization minimizes 
the loss function 


F(M,A) = |[X - AY M" ||? + AIIM - GI? Q) 


where G is n x c membership matrix based on expert judgements. If c is number of 
classes then gje = 0 if object (document) i belongs to class c, and 0 otherwise. By the 
second summand in the loss function we penalize clusterings that are distant from 
the classes by expert judgements using parameter A that regularizes the importance 
of that penalization. We use the alternating least squares (ALS) algorithm analogous 
to the one in [4] which alternates between corrections of the loading matrix A in 
one step and of the membership matrix M in another. As each of the steps in the 
ALS algorithm improves the loss function, the algorithm converges to at least a local 
minimum. By starting the procedure from a large number of random initial estimates 
and choosing the best solution, the chances of obtaining the global minimum are 
increased. 
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4 Experiment 
4.1 Design of Experiment 


Experiments are conducted for classification on the Reuters-21578 data set, specifi- 
cally using the ModApte Split which assigns Reuters reports from April 7, 1987 and 
before to the training set, and after, until end of 1987, to the test set. It consists of 
9603 training and 3299 test documents. The collection has 90 classes which contain 
at least one training and test document. Documents are represented by a bag of words 
representation. A list of index terms is formed based on terms that appear in at least 
four documents of the collection, which resulted in a list of 9867 index terms. 

Classification is conducted by logistic regression (LR) and SVM algorithm. The 
basic model is the bag of words representation (full representation), while repre- 
sentations in the lower-dimensional space are obtained by SVD (Latent Sematic 
Analysis), RKM and RKM with penalization (A = 0.1,0.2, 0.4, 0.6). For RKM and 
RKM with penalization representations are obtained by applying matrix factoriza- 
tion on the term-document matrix of the training documents, and by projection of 
test documents on factors given by matrix A in the factorization. RKM is computed 
for 90 clusters (which corresponds to the number of classes in the collection) using 
as dimension of the lower-dimensional space k — 85, and truncated SVD is com- 
puted for k = 85 as well. The RKM and RKM with penalization algorithms are run 
10 times (with different starting estimates), and the representation and factorization 
with the minimal loss function is chosen. The optimal cost parameter for LR and 
SVM is chosen by grid search technique from the set of values 0.1, 0.5, 1, 10, 100 
and 1000. For the classification methods, the LiblineaR library in R is used, while 
RKM and RKM with penalization algorithm are implemented in Matlab. 


4.2 Results 


Results are given in terms of precision, recall, and F; measure of the classification. 
Recall is proportion of correctly classified samples among all positive samples (i.e., 
samples actually belonging to the class, according to the expert), while precision is 
proportion of correctly classified samples among all samples classified as positive 
by the model. In the Figures 1 and 2, are shown results of average P, measures of 
classification for 5 classes sorted in descending order by their size, i.e. number of 
train documents (which is 2877 to 389 for classes 1-5, 369 to 181 for classes 6-10, 
140 to 111 for classes 11-15, 101 to 75 for classes 16-20, 75 to 55 for classes 21-25, 
50 to 41 for classes 26-30, 40 to 37 for classes 31-35, 35 to 24 for classes 36-40, 
23 to 19 for classes 41-45, 18 to 16 for classes 46-50, 16 to 13 for classes 51-55, 
and 13-10 for classes 56-60). Figure 1 shows the results for classification by LR, 
while Figure 2 for classification by SVM. Only the 60 largest classes are observed 
since smaller classes (less than 10 training documents) are not interesting for the 
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Fig. 1 Average F measure of classification by LR for 5 classes sorted by their size. 


research, because for those classes recall is low and it can be expected that full bag 
of words representation will result in better recognition since classes can possibly be 
recognized by key words, but not by transformed representations. It can be seen that 
F| measures are comparable for the full representation and various representations 
by RKM with penalization for both classification algorithms for the biggest 25 
classes. For smaller classes results for representation by RKM with penalization are 
unstable, although for some classes they were better than the basic representation 
(in the case of LR). Classification for representations obtained by SVM and RKM 
without penalization resulted in lower F; measures for all class sizes. 

In Table 1 are shown average precision, recall and F; measures for the 25 largest 
classes for both classification algorithms and all observed representations. In the case 
of classification by LR the average recall is improved for representation by REM 
with penalization (for 4 = 0.4) approximately 1% compared to basic full represen- 
tation. For classification by SVM average precision is improved for representation 
by RKM with penalization (for A = 0.6) for almost 6% and F; measure is improved 
for representation by RKM with penalization (A = 0.4) for 2% in comparison to 
the basic full representation. The best results are obtained for classification by the 
SVM algorithm and representation with RKM with penalization with 4 = 0.2 for 
which precision is improved for 5% with the similar level of recall as in the basic 
representation. 
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Fig. 2 Average Fı measure of classification by SVM for 5 classes sorted by their size. 
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Table 1 Average precision, recall, and Fı measure of classification for the 25 largest classes. 


Class. algorithm Logistic regression SVM 
Representation Precision Recall Fi Precision Recall Fi 
Full 86.31 770.24 76.84 82.76 71.72 76.47 
SVD 82.80 64.84 71.42 85.24 61.61 68.99 
RKM 80.80 61.10 68.44 82.93 55.66 63.83 
RKMPenal, A = 0.1 84.24 70.71 76.27 87.24 71.01 77.62 
RKMPenal, A = 0.2 84.68 71.23 76.72 87.78 72.16 78.57 
RKMPenal, A = 0.4 84.72 71.38 76.88 87.86 64.93 73.87 
RKMPenal, A = 0.6 85.89 70.40 76.80 88.40 66.11 74.75 


5 Conclusions and Further Work 


In this paper we propose a modification of the RKM method that simultaneously 
clusters documents and extracts factors on one side, and penalizes clusterings that are 
distant from the classification of the training documents given by experts on the other 
side. We show that such a modification enables representation of textual documents 
in a semantic lower-dimensional space that improves performance of classification. 
The method is tested for classes of Reuters-21758 data set and compared to the 
full bag of words representation and the method of LSA. It is also shown that the 
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original RKM method without proposed modification does not have the same effect 
on classification performance; it has a similar effect as the LSA method. 

The proposed representation method can improve precision and recall of classi- 
fication for sufficiently large classes, i.e. those that have enough training documents 
to enable capturing of semantic relations and characteristics of classes. A more 
important effect can be observed in the improvement of precision. 

In the future we plan to investigate hybrid models using representation of words by 
neural language models and application in different domains, such as classification 
of images. 
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Trends in Data Stream Mining 


Joáo Gama 


Abstract Learning from data streams is a hot topic in machine learning and data 
mining. This article presents our recent work on the topic of learning from data 
streams. We focus on emerging topics, including fraud detection and hyper-parameter 
tuning for streaming data. The first study is a case study on interconnected by-pass 
fraud. This is a real-world problem from high-speed telecommunications data that 
clearly illustrates the need for online data stream processing. In the second study, 
we present an optimization algorithm for online hyper-parameter tuning from non- 
stationary data streams. 


Keywords: fraud detection, hyperparameter tuning, learning from data streams 


1 Introduction 


The developments of information and communication technologies dramatically 
change the data collection and processing methods. What distinguishes current data 
sets from earlier ones are automatic data feeds. We do not just have people entering 
information into a computer. We have computers entering data into each other. In 
most challenging applications, data are modeled best not as persistent tables, but 
rather as transient data streams. 

This article presents our recent work on the topic of learning from data streams. 
It is organized into main sections. The first one is a real-world application of data 
stream techniques to a telecommunications fraud detection problem. It is based on 
the work presented in [5]. The second topic discusses the problem of hyperparameter 
tuning in the context of data stream mining. It is based on the work presented in [4]. 
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2 Fraud Detection: a Case Study 


The high asymmetry of international termination rates with regard to domestic ones, 
where international calls have higher charges applied by the operator where the call 
terminates, is fertile ground for the appearance of fraud in Telecommunications. 
There are several types of fraud that exploit this type of differential, being the 
Interconnect Bypass Fraud one of the most expressive [1, 3]. 

In this type of fraud, one of several intermediaries responsible for delivering the 
calls forwards the traffic over a low-cost IP connection, reintroducing the call in the 
destination network already as a local call, using VOIP Gateways. This way, the 
entity that sent the traffic is charged the amount corresponding to the delivery of 
international traffic. However, once it is illegally delivered as national traffic, it will 
not have to pay the international termination fee, appropriating this amount. 

Traditionally, the telecom operators analyze the calls of these Gateways to detect 
the fraud patterns and, once identified, have their SIM cards blocked. The constant 
evolution in terms of technology adopted on these gateways allows them to work 
like real SIM farms capable of manipulating identifiers, simulating standard call 
patterns similar to the ones of regular users, and even being mounted on vehicles to 
complicate the detection using location information. 

The interconnect bypass fraud detection algorithms typically consume a stream 
S of events, where S contains information about the origin number A — Number, 
the destination number B — Number, the associated timestamp, and the status of the 
call (accomplished or not). The expected output of this type of algorithm is a set of 
potential fraudulent A — Numbers that require validation by the telecom operator. 
This process is not fully automated to avoid blocking legit A — Numbers and getting 
penalties. In the interconnect bypass fraud, we can observe three different types of 
abnormal behaviors: 


1. the burst of calls, which are A — Numbers that produce enormous quantities of 
#calls (above the calls of all A — Numbers) during a specific time window 
W. The size of this time window is typically small; 

2. the repetitions, which are the repetition of some pattern (#calls) produced by a 
A — Number during consecutive time windows W; 

3. the mirror behaviors, which are two distinct A — Numbers (typically these 
A — Numbers are from the same country) that produces the same pattern of 
calls (#calls) during a time window W. 
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Algorithm 3 The Lossy Counting 
with Fast Forgetting Algorithm. 


1: procedure LossyCountinc(S: A Se- 
quence of Examples; e: Error margin; a: 
fast forgetting parameter) 


Algorithm 2 The Lossy Counting 
Algorithm. 


1: procedure LossvCouwrING(S: A Se- 
quence of Examples; e: Error margin; œ: 
fast forgetting parameter) 


—0:A —0T — 0: 
2 n—0;A —0;T —0; ae 0,8 0T. 9 
3 for example e € S do 
3 for example e € S do 
4: n—n+l 
4: n«cn-cl PORE ; 
SP ; 5: if e is monitored then 
5: if e is monitored then 
6: Increment Counte 
6: Increment Counte 
7 élie 7 else 
: c— 
8: T =T U{e,1+4A} 3 A DUE eg 
9: end if 9: end if 
10: if [2] + A then if [$] #4 then 
€ = 11: Ac” 
11: Ac# s € 
> € 12: end if 
12: end if ; 
c 13: for all j € T do 
13: for all j € T do 
i 14: Count; — a * Count; 
14: if Count; « ó then ; T J 
à ; 15: if Count; « ó then 
15: T —T\{j} ; : 
! ; 16: T —T\{j} 
16: end if R 
17: end if 
17: end for 
18: end for 18: end for 
19: end for 


19: end procedure 20: end procedure 


Figures 1 and 2 present the evolving top-10 most active phone numbers. The 
first Figure 1 presents the top-10 cumulative counts, while the Figure 2 presents the 
top-10 counts with forget. 


3 Learning to Learn Hyperparameters 


A hyperparameter is a parameter whose value is used to control the learning process. 
Hyperparameter optimization (or tuning) is the problem of choosing a set of optimal 
hyper-parameters for a learning algorithm. For this propose we adapt the Nelder- 
Mead algorithm [4] for the streaming context. This algorithm is a simplex search 
algorithm for multidimensional unconstrained optimization without derivatives. The 
vertexes of the simplex, which define a convex hull shape, are iteratively updated 
in order to sequentially discard the vertex associated with the largest cost function 
value. 

The Nelder-Mead algorithm relies on four simple operations: reflection, shrink- 
age, contraction and expansion. Figure 3 illustrates the four corresponding Nelder- 
Mead operators R, S, C and E. Each vertex represents a model containing a set of 
hyper-parameters. The vertexes (models under optimisation) are ordered and named 
according to the root mean square error (RMSE) value: best (B), good (G), which is 
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Fig. 2 Approximate Counts with Lossy Counting and Fast Forgetting. 


the closest to the best vertex, and worst (W). M is a mid vertex (auxiliary model). The 
bottom panel in Figure 3 describe the four operations: Contraction, Reflexion, 
Expansion, and Shrink. 

For each Nelder-Mead operation, it is necessary to compute an additional set of 
vertexes (midpoint M, reflection R, expansion E, contraction C and shrinkage S) 
and verify if the calculated vertexes belong to the search space. First, the algorithm 
computes the midpoint (M) of the best face of the shape as well as the reflection 
point (R). After this initial step, it determines whether to reflect or expand based on 
the set of heuristics. 

The dynamic sample size, which is based on the RMSE metric, attempts to identify 
significant changes in the streamed data. Whenever such a change is detected, the 
Nelder-Mead compares the performance of the n+ 1 models under analysis to choose 
the most promising model. The sample size Ssize is given by Equation 1 where c 
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Fig. 3 SPT working modes: Exploration and Deployment. Bottom panel illustrates the Nelder & 
Mead operators. 


represents the standard deviation of the RMSE and M the desired error margin. We 
use M = 95%. 
4c? 
Mw? 
However, to avoid using small samples, that imply error estimations with large 
variance, we defined a lower bound of 30 samples. The adaptation of the Nelder- 
Mead algorithm to on-line scenarios relies extensively on parallel processing. The 
main thread launches the n+ 1 model threads and starts a continuous event processing 
loop. This loop dispatches the incoming events to the model threads and, whenever 
it reaches the sample size interval, assesses the running models, and calculates the 
new sample size. The model assessment involves the ordering of the n + 1 models 
by RMSE value and the application of the Nelder-Mead algorithm to substitute the 
worst model. The Nelder-Mead parallel implementation creates a dedicated thread 
per Nelder-Mead operator, totaling seven threads. Each Nelder-Mead operator thread 
generates a new model and calculates the incremental RMSE using the instances of 
the last sample size interval. The worst model is substituted by the Nelder-Mead 
operator thread model with the lowest RMSE. 

Figure 4 presents the critical difference diagram [2] of three hyper-parameter 
tuning algorithms: SPT, Grid search, default parameter values on four benchmark 
classification datasets. The diagram clearly illustrates the good performance of SPT. 


Ssize = (1) 


4 Conclusions 


This paper reviews our recent work in learning from data streams. The two works 
present different approaches to dealing with high-speed and time-evolving data: 
from applied research in fraud detection to fundamental research on hyperparameter 
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Fig. 4 Critical Difference Diagram comparing Self hyperparameter tuning, Grid hyperparameter 
tuning, and default parameters in 4 classification problems. 


optimization for streaming algorithms. The first work identifies burst on the activity 
in phone calls, using approximate counting with forgetting. The last work presents a 
streaming optimization method to find the minimum of a function and its application 
in finding the hyper-parameter values that minimize the error. We believe that the 
two works reported here will have an impact on the work of other researchers. 
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Old and New Constraints in Model Based 
Clustering 


Luis A. García-Escudero, Agustín Mayo-Iscar, Gianluca Morelli, and Marco Riani 


Abstract Model-based approaches to cluster analysis and mixture modeling of- 
ten involve maximizing classification and mixture likelihoods. Without appropriate 
constrains on the scatter matrices of the components, these maximizations result 
in ill-posed problems. Moreover, without constrains, non-interesting or "spurious" 
clusters are often detected by the EM and CEM algorithms traditionally used for 
the maximization of the likelihood criteria. A useful approach to avoid spurious 
solutions is to restrict relative components scatter by a prespecified tuning constant. 
Recently new methodologies for constrained parsimonious model-based clustering 
have been introduced which include the 14 parsimonious models that are often ap- 
plied in model-based clustering when assuming normal components as limit cases. 
In this paper we initially review the traditional approaches and illustrate through an 
example the benefits of the adoption of the new constraints. 
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1 Introduction 


Given a sample of observations (xj,...,x4,) in RP, a widely used method in un- 
supervised learning is to assume multivariate normal components and to adopt a 
maximum likelihood approach for clustering purposes. With this idea in mind, well- 
known classification and mixture likelihood approaches can be followed. 

In this work, we use ¢(-; u, X) to denote the probability density function of a 
p-variate normal distribution with mean u and covariance matrix X. 

In the classification likelihood approach we search for a partition (Hi, ..., Hx) of 
the indices (1,--- ,n}, centres uj, :-- , uk in RP, symmetric positive semidefinite 
Pp Xp scatter matrices X1,--- , 2 and positive weights 71,- , 7, with m n=l, 
which maximize 


k 
> br log (rji; uj. X.)) A (1) 


j-lieH; 


On the other hand, in the mixture likelihood approach, we seek the maximization 
of 


n k 
2 log | 2 mbas uj. Ep |. Q) 
i=l j=l 


with similar notation and conditions on the parameters as above. In this second 
approach, a partition into k groups can be also obtained, from the fitted mixture 
model, by assigning each observation to the cluster-component with the highest 
posterior probability. 

Unfortunately, it is well-known that the maximization of “log-likelihoods" like 
(1) and (2) without constraints on the 2; matrices is a mathematically ill-posed 
problem [1, 2]. To see this unboundedness issue, we can just take u, = x1, 7; > 0 
and |X,| — 0 making (2) to diverge to infinity or (1) also to diverge with Hı = {1}. 

This lack of boundedness can be solved by just focusing on local maxima of 
the likelihood target functions. However, many local maxima are often found and 
it is difficult to know which are the most interesting ones. See [3] for a detailed 
discussion of this issue. In fact, non-interesting local maxima denoted as "spurious" 
solutions, which consist of a few, almost collinear, observations, are often detected 
by the Classification EM algorithm (CEM), traditionally applied when maximizing 
(1), and by the EM algorithm, traditionally applied when maximizing (2). A recent 
review of approaches for dealing with this lack of boundedness and for reducing the 
detection of spurious solutions can be found in [4]. 

It is also common to enforce constraints on the 2; scatter matrices when maxi- 
mizing (1) or (2). Among them, the use of “parsimonious” models [5, 6] is one of 
the most popular and widely applied approaches in practice. These parsimonious 
models follow from a decomposition of the 2:; scatter matrices as 


X; = AQT, (3) 


with A; = PAKU (volume parameters), 
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p 
I; = diag(y;1, z Vil v» Vip) with det(I’;) = BE ES! 
{=t 


(shape matrices), and Q; (rotation matrices) with Q jQ5 = ly. Different constraints 
on the 4j, Q; and I’; elements are considered across components to get 14 par- 
simonious models (which are coded with a combination of three letters). These 
models reduce notably the number of free parameters to be estimated, so improving 
efficiency and model interpretability. Moreover, many of them turn the constrained 
maximization of the likelihoods into well-defined problems and help to avoid spu- 
rious solutions. Unfortunately, the problems remain for models with unconstrained 
A; volume parameters, which are coded with the first letter as a V (V** models). 
Aside from relying on good initializations, it is common to consider the early stop- 
ping of iterations when approaching scatter matrices with very small eigenvalues 
or when detecting components accounting for a reduced number of observations. A 
not fully iterated solution (or no solution at all) is then returned in these cases. The 
idea is known, for instance, to be problematic when dealing with (well-separated) 
components made up of a few observations. 

Starting from a seminal paper by [7], an alternative approach is to constrain the 
x; scatter matrices by specifying some tuning constants that control the strength of 
the constraints. In this direction, the ratio between the largest and the smallest of the 
k x p eigenvalues of the 2; matrices was forced to be smaller than a given fixed 
constant c* > 1 [8, 9, 10, 11, 12]. This means that the maximization of (1) and (2) 
is done under the (more simple) constraint: 


max 4,(2;)/min 4(;) < č, (4) 
J J 


where Qu (Z», are the set of eigenvalues of the X; matrix, j = 1, ..., k. 

With this eigenvalue-ratio approach, we need a very high c* value to be close to 
affine equivariance. Unfortunately, such a high c* value does not always successfully 
prevent us from incurring into spurious solutions. 


2 The New Constraints 


García-Escudero ef al. [13] have recently introduced three different types of con- 
straints on the 2; matrices which depend on three constants Cget, Cshw and cnp all of 
them being greater than or equal to 1. 

The first type of constraint serves to control the maximal ratio among determinants 
and, consequently, the maximum allowed difference between component volumes: 


p 
max j=1,....k |2j| max j=1,...,42; 
*deter": j JA. = j 


- = — < Cdet. (5) 
minj=1,..,k |j] | minjzi s a’ 
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The second type of constraint controls departures from sphericity “within” each 
component: 


shape-"within": € Cshw for j = 1,..., k. (6) 


This provides a set of k constraints that in the most constrained case, Cshw = 1, 
imposes I’) = ... = Ip = lp, where lp is the identity matrix of size p, i.e., sphericity 
of components. 

Note that the new determinant-and-shape constraints (based on Cadet > 1 and 
Cshw = 1) in (4) allow us to deal with spherical “heteroscedastic" cases, whereas the 
eigenvalue ratio constraint with c* = 1 can only handle the spherical “homoscedastic" 
case. Constraints (5) and (6) were the basis for the “deter-and-shape” constraints in 
[14]. These two constraints alone resulted in mathematically well-defined constrained 
maximizations of the likelihoods in (1) and (2). However, although highly operative 
in many cases, they do not include, as limit cases, all the already mentioned 14 
parsimonious models. For instance, we may be interested in the same (or not very 
different) T; or 2; matrices for all the mixture components and these cannot be 
obtained as limit cases from the “deter-and-shape” constraints. 

The third constraint serves to control the maximum allowed difference between 
shape elements “between” components: 


max j=1,...,k Y jl 


shape-"between": € Cghb for | = 1, ..., p. (7) 


This new type of constraint allows us to impose “similar” shape matrices for the 
components and, consequently, enforce IT) = ... = I’, in the most constrained 
Cshb = | case. 


3 An Illustration Example of the New Constraints 


Figure 1 shows an example based on three groups. The data have been generated 
imposing equal determinants cget = 1, a sensible departure from sphericity “within” 
each component Cghw = 40 and a very moderate difference “between” shape elements 
components, Cshb = 1.3. No constraint has been imposed on the rotation matrices. 
Finally an average overlap of 0.10 has been imposed. The generation of these data 
sets has been done through the MixSim method of [15], as extended by [16] and 
incorporated into the FSDA Matlab toolbox [17]. The overlap is defined as a sum of 
pairwise misclassification probabilities. See more details in [16]. 

The application of traditional tclust approach with maximum ratio between eigen- 
values (c*) respectively equal to 128 and 10!° produces the classifications shown 
in the left panels of Figure 2. In fact, it could be seen that the results in the top 
left panel would be exactly the same one for any choice of c* within the interval 
[16, 128]. This means that a higher value of c* would be apparently needed to detect 
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Fig. 1 An example with simulated data with 3 clusters in two dimensions. The average overlap is 
0.10. The data have been generated using equal determinants, moderate difference between shape 
elements “between” components and sensible departure from sphericity “within” each component. 


those two almost parallel clusters that were shown in Figure 1. However, choosing 
a value greater for c* may destroy the desired protection against spurious solutions 
provided by the constraints. For example, we see in the lower left panel how the 
choice c* = 10!° results in the detection of a spurious group consisting of a single 
observation. 

The panels on the right, on the other hand, show the partitions resulting from the 
3 new constraints imposed on the components covariance matrices. The top right 
panel shows the result of applying the 3 new restrictions with values of the tuning 
constants very close to the real values used to generate the dataset. We can see 
that, in this case, it is possible to recover the real structure of the data generating 
process. Moreover, the real cluster structure is also recovered in the low right panel 
by choosing larger values of these tuning constants, but not too large just to avoid 
detection of spurious solutions. Some guidelines about how to choose these tuning 
constants can be found in [13]. 
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Fig. 2 Comparison between the traditional (left panels) and new tclust procedure (right panels). 
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Clustering Student Mobility Data in 3-way 
Networks 


Vincenzo Giuseppe Genova, Giuseppe Giordano, Giancarlo Ragozini, and Maria 
Prosperina Vitale 


Abstract The present contribution aims at introducing a network data reduction 
method for the analysis of 3-way networks in which classes of nodes of different 
types are linked. The proposed approach enables simplifying a 3-way network into 
a weighted two-mode network by considering the statistical concept of joint depen- 
dence in a multiway contingency table. Starting from a real application on student 
mobility data in Italian universities, a 3-way network is defined, where provinces of 
residence, universities and educational programmes are considered as the three sets 
of nodes, and occurrences of student exchanges represent the set of links between 
them. The Infomap community detection algorithm is then chosen for partitioning 
two-mode networks of students' cohorts to discover different network patterns. 


Keywords: 3-way network, complex network, community detection, mobility data, 
tertiary education 


Vincenzo Giuseppe Genova 
Department of Economics, Business, and Statistics, University of Palermo, Italy, 
e-mail: vincenzogiuseppe.genovaQunipa.it 


Giuseppe Giordano 
Department of Political and Social Studies, University of Salerno, Italy, 
e-mail: ggiordanoQunisa.it 


Giancarlo Ragozini 
Department of Political Science, Federico II University of Naples, Italy, 
e-mail: giragoz@unina.it 


Maria Prosperina Vitale (Pl) 
Department of Political and Social Studies, University of Salerno, Italy, 
e-mail: mvitale@unisa.it 


© The Author(s) 2023 147 
P. Brito et al. (eds.), Classification and Data Science in the Digital Age, 

Studies in Classification, Data Analysis, and Knowledge Organization, 
https://doi.org/10.1007/978-3-031-09034-9 17 


148 V. G. Genova et al. 


1 Introduction 


Many complex relational data structures can be described as multimode or multiway 
networks in which nodes belonging to different modes are linked. The most common 
multimode network in social networks is represented by the affiliation network, 
where two-mode data, actors and events, form a bipartite graph divided into two 
groups [6]. In the case of tripartite networks, we deal with three types of nodes, and 
different graph structures can be defined. 

Although only a few papers deal with methods for these networks, in recent years, a 
growing number of works have appeared —especially in bipartite and tripartite cases— 
to disentangle the inherent complexity of such kinds of data structures. Looking at 
clustering and community detection algorithms proposed to partition a network into 
groups, we can identify some strands, all deriving from generalizations of methods 
suited for one-mode [19] and two-mode networks [2]. A classical approach consists 
of applying the usual community detection algorithms on a unique supra-adjacency 
matrix defined by combining all the possible two-mode networks in a block matrix 
[11, 15]. Alternative methods rely on projecting each two-mode networks and on 
applying separately the usual community detection algorithms on these matrices 
[10]. In addition, there are methods adopting both an optimization procedure for 
3-way networks [16, 17, 14] by extending the idea of bipartite modularity [2], and 
an indirect blockmodeling approach by deriving a dissimilarity measure based on 
structural equivalence concept [3]. 

In our opinion, approaches based on the analysis of the k-modes examined con- 
sidering the collection of the k(k — 1)/2 two-mode networks [10] cannot take into 
account statistical associations among all modes at same time. Hence, the aim of the 
contribution is to present a network data reduction method based on the concept of 
joint dependence in a multiway contingency table [1]. 

Starting from real applications on the Italian student mobility phenomenon in 
higher education [12, 21, 7, 8, 13, 22], a 3-way network is defined, where provinces 
of residence, universities and educational programmes are considered as the three 
modes. Student mobility flows, measured in terms of occurrences, represent the set 
of links between them. Assuming that the statistical dependency between the set of 
nodes provinces of residence and the other two sets of nodes can be captured by 
the joined pair of nodes (universities and educational programmes), the tripartite 
network is transformed into a bipartite network, where the two modes are given by 
Italian provinces of residence (first mode) and the set of nodes given by all possi- 
ble pairs of universities and educational programmes (second mode). Thus, taking 
advantage of this approach of network simplification, network indexes and cluster- 
ing techniques for bipartite networks are available. Hence, the Infomap community 
detection algorithm is adopted [9, 4] to partition the derived network. 

The remainder of the paper is organized as follows. Section 2 presents the details 
of the proposed strategy of analysis, and the main results are reported from the 
analysis of student mobility data of Italian universities. Section 3 provides final 
remarks. 
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2 Simplification of 3-way Networks 


In the present paper, the case of a tripartite network is considered as an example 
to show how the proposed network data simplification method works. In particular, 
we consider the real case study of student mobility paths in Italian universities. The 
MOBYSU.IT dataset! enables reconstruction of network data structures considering 
student mobility flows among territorial units and universities. 

More formally, given Vp = {p1,...,Pi,-.-,Pr}, the set of J provinces of 
residence; Vy = {u1,...,uj,...,uy}, the set of J Italian universities, and 
Ve = {e1,...,e@k,..-,ex}, the set of K educational programmes, a weighted tri- 
partite 3-uniform hyper-graph 7 can be defined, consisting of a triple (V, £L, W), 
with V = {Vp, Vy, Ve } the collection of three sets of vertices, one for each mode, 
and being £L = (.Épu kg). Lepur € Vp X Vy X Vz, the collection of hyper-edges, 
with generic term (pi, uj, ex), which is the link joining the i-th province, the j-th 
university, and the k-th educational programme. Finally, W is the set of weights, 
obtained by the function w : Lpug — N, and u(pi, uj, ex) = wijk is the number 
of students moving from a province p; towards a university u; in an educational 
programme ex. Such a network structure can be described as a three-way array 
A = (aijx), with aijk = Wijk, and it has been called a 3-way network [3]. 

To deal with such a complex network structure and aiming at obtaining commu- 
nities in which three modes are mixed, we wish to simplify the tripartite nature of 
the graph, without losing any significant information. In statistical terms, the array 
A can be interpreted as a 3-way contingency table, and then the statistical techniques 
to evaluate the association among variables (i.e. the modes) can be exploited [1]. 
Because a 3-way contingency table is a cross-classification of observations by the 
levels of three categorical variables, we are defining a network structure where the 
sets of nodes are the levels of the categorical variables. Specifically, we assume that 
if two modes are jointly associated —as are, for their own nature, universities and 
educational programmes- the tripartite network can be logically simplified into a 
bipartite one. In the student mobility network, we join the pair of nodes in Vy and 
in Vg, and then we deal with the relationships between these dyads and the nodes 
in Vp : 

Following this assumption, the sets of nodes Vy and Vg are put together into a 
set of joint nodes, namely Viz. The tripartite network 7 can now be represented 
as a bipartite network B given by the triple {V*, £L*, W*}, with V* = (Vp, Vyz}. 
The set of hyper-edges £ is thus simplified into a set of edges L* = {LP UE}, 
LpueE € Vp X Vur. The new edges (pi, (uj; ex)) connect a province p; with an 
educational programme e; running in a given university uj. The weights W* are 
the same as in the hyper-graph 7 , i.e., Wr, jy = Wijk- Note that the weights contained 
in the 3-way array A are preserved, but are now organized in a rectangular matrix A 
of J rows and (J x K) columns. 


! Database MOBYSU.IT [Mobilità degli Studi Universitari in Italia], research protocol MUR - 
Universities of Cagliari, Palermo, Siena, Torino, Sassari, Firenze, Cattolica and Napoli Federico II, 
Scientific Coordinator Massimo Attanasio (UNIPA), Data Source ANS-MUR/CINECA. 
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Taking advantage of this method, we aim to analyse weighted bipartite graphs 
adopting clustering methods. Among others, we use the Infomap community de- 
tection algorithm [9, 4] to study the flows' patterns in network structures instead 
of modularity optimization proposed in topological approaches [18, 5]. Indeed, the 
rationale of this algorithm —map equation- takes advantage of the duality between 
finding communities and minimizing the length —codelength— of a random walker's 
movement on a network. The partition with the shortest path length is the one that 
best captures the community structure in the bipartite data. Formally, the algorithm 
defines a module partition M of n vertices into m modules such that each vertex is 
assigned to one and only one module. The Infomap algorithm looks for the best M 
partition that minimizes the expected codelength, L(M), of a random walker, given 
by the following map equation: 


L(M) = qa H(2) + >) pH!) (1) 
i-l 
In equation (1), qa H(.2) represents the entropy of the movement between mod- 
ules weighed for the probability that the random walker switches modules on any 
given step (q^), and 5)" | pi H (4) is the entropy of movements within modules 
weighed for the fraction of within-module movements that occur in module i, plus 
the probability of exiting module i (p), such that Y7", p$, = 1 * q [9]. 

In our case, the Infomap algorithm is adopted to discover communities of students 
characterized by similar mobility patterns. Indeed, to analyse mobility data, where 
links represent patterns of student movement among territorial units and universities, 
flow-based approaches are likely to identify the most important features. Finally, in 
our student mobility network, to focus only on relevant student flows, a filtering 
procedure is adopted by considering the Empirical Cumulative Density Function 
(ECDF) of links’ weights distribution. 


2.1 Main Findings 


Students’ cohorts enrolled in Italian universities in four academic years (a.y.) 2008— 
09, 2011-12, 2014-15, and 2017-18 are analysed. The number of nodes for the sets 
Vp (107 provinces), Vy (79-80 universities), and Vg (45 educational programmes), 
and the number of students involved in the four cohorts are quite stable over time 
(Table 1). Furthermore, the percentage of movers (i.e., students enrolled in a univer- 
sity outside of their region of residence) increased, from 16.4% in the a.y. 2008-09 
to 20.6% in the a.y. 2017—18, and it is higher for males than females. 
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Table 1 Percentage of students according to their mobility status by cohort and gender. 


Mover status 
Cohort Gender Stayers% Movers% 


F 136381 84.2 15.8 
2008-09 M 106,950 82.8 17.2 
Total 243,331 83.6 16.4 


F 126,606 81.7 18.3 
2011-12 M 102,79 80.9 19.1 
Total 229,085 81.0 19.0 


F 121121 80.5 19.5 
2014-15 M 102,58 80.4 19.6 
Total 223,479 80.5 19.5 


F 134315 79.1 20.9 
2017-18 M 113,496 79.8 20.2 
Total 247,811 79.4 20.6 


Following the network simplification approach, the tripartite networks —one for 
each cohort- are simplified into bipartite networks, and the four ECDFs of links’ 
weights are considered to filter relevant flows. The distributions suggest that more 
than 50% of links between pairs of nodes have weights equal to | (i.e., flows of only 
one student), and about 95% of flows are characterized by flows not greater than a 
digit. Thus, networks holding links with a value greater or equal to 10 are further 
analysed. 

To reveal groups of universities and educational programmes attracting students, 
the Infomap community detection algorithm is applied. Looking at Table 2, we 
notice a reduction of the number of communities from the first to the last student 
cohort, suggesting a sort of stabilization in the trajectories of movers towards brand 
universities of the center-north with also an increase in the north-north mobility [20], 
and arelevant dichotomy between scientific and humanistic educational programmes. 
Network visualizations by groups (Figures | and 2) confirm that the more attractive 
universities are located in the north of Italy, especially for educational programmes 
in economics and engineering (the Bocconi University, the Polytechnic of Turin and 
the Cattolica University). 


Table 2 Number of communities, codelength, and relative saving codelength per cohort. 


Relative saving 
Cohort Communities Codelength codelength 


2008-09 14 0.96 83% 
2011-12 17 1.72 70% 
2014-15 3 5:23 12% 


2017-18 3 1.00 83% 
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Fig. 1 Network visualization by groups, student cohort a.y. 2008-09. 


Fig. 2 Network visualization by groups, student cohort a.y. 2017-18. 
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3 Concluding Remarks 


The proposed simplification network strategy on tripartite graphs defined for student 
mobility data provides interesting insights for the phenomenon under analysis. The 
main attractive destinations still remain the northern universities for educational 
programmes, such as engineering and business. Besides the well-known south-to- 
north route, other interregional routes in the northern area appear. In addition, the 
reduction in the number of communities suggests a sort of stabilization in terms of 
mobility roots of movers towards brand universities, highlighting student university 
destination choices close to the labor market demand. 

Hyper-graphs and multipartite networks still remain very active areas for research 
and challenging tasks for scholars interested in discovering the complexities underly- 
ing these kinds of data. Specific tools for such complex network structures should be 
designed combining network analysis and other statistical techniques. As future lines 
of research, the comparison of community detection algorithms that better represent 
the structural constraints of the phenomena under analysis and the assessment of 
other backbone approaches to filter the significant links will be developed. 
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Clustering Brain Connectomes Through a 
Density-peak Approach 


Riccardo Giubilei 


Abstract The density-peak (DP) algorithm is a mode-based clustering method that 
identifies cluster centers as data points being surrounded by neighbors with lower 
density and far away from points with higher density. Since its introduction in 2014, 
DP has reaped considerable success for its favorable properties. A striking advantage 
is that it does not require data to be embedded in vector spaces, potentially enabling 
applications to arbitrary data types. In this work, we propose improvements to 
overcome two main limitations of the original DP approach, i.e., the unstable density 
estimation and the absence of an automatic procedure for selecting cluster centers. 
Then, we apply the resulting method to the increasingly important task of graph 
clustering, here intended as gathering together similar graphs. Potential implications 
include grouping similar brain networks for ability assessment or disease prevention, 
as well as clustering different snapshots of the same network evolving over time to 
identify similar patterns or abrupt changes. We test our method in an empirical 
analysis whose goal is clustering brain connectomes to distinguish between patients 
affected by schizophrenia and healthy controls. Results show that, in the specific 
analysis, our method outperforms many existing competitors for graph clustering. 


Keywords: nonparametric statistics, mode-based clustering, networks, graph clus- 
tering, kernel density estimation 


1 Introduction 


Clustering is the task of grouping elements from a set in such a way that elements 
in the same group, also defined as cluster, are in some sense similar to each other, 
and dissimilar to those from other groups. Mode-based clustering is a nonparametric 
approach that works by first estimating the density, and then identifying in some 
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way its modes and the corresponding clusters. An effective method to find modes 
and clusters is through the density-peak (DP) algorithm [12], which has drawn 
considerable attention since its introduction in 2014. One of the striking advantages 
of DP is that it does not require data to be embedded in vector spaces, implying that 
it can be applied to arbitrary data types, provided that a proper distance is defined. 
In this work, we focus on its application to clustering graph-structured data objects. 

The expression graph clustering can refer either to within-graph clustering or 
to between-graph clustering. In the first case, the elements to be grouped are the 
vertices of a single graph; in the second, the objects are distinct graphs. Here, graph 
clustering is intended as between-graph clustering. Between-graph clustering is an 
emerging but increasingly important task due to the growing need of analyzing and 
comparing multiple graphs [10, 4]. Potential applications include clustering: brain 
networks of different people for ability assessment, disease prevention, or disease 
evaluation; online social ego networks of different users to find people with similar 
social structures; different snapshots of the same network evolving over time to 
identify similar patterns, cycles, or abrupt changes. 

Heretofore, the task of between-graph clustering has not been exhaustively in- 
vestigated in the literature, implying a substantial lack of well-established methods. 
The goal of this work is to improve and adapt the density-peak algorithm to define a 
fairly general method for between-graph clustering. For validation and comparison 
purposes, the resulting procedure and its main competitors are applied to grouping 
brain connectomes of different people to distinguish between patients affected by 
schizophrenia and healthy controls. 


2 Related Work 


Existing techniques for between-graph clustering can be divided into two main 
categories: 1) transforming graph-structured data objects into Euclidean feature 
vectors in order to apply standard clustering algorithms; 2) using the distances 
between the original graphs in distance-based clustering methods. 

The most common technique within the first category is the use of classical 
clustering techniques on the vectorized adjacency matrices [10]. Nonetheless, more 
advanced numerical summaries have been proposed to better capture the structural 
properties of the graphs and to decrease feature dimensionality. Examples include: 
shell distribution [1], traces of powers of the adjacency matrix [10], and graph 
embeddings such as graph2vec [11]; see [4] for a longer list. Techniques from the 
first category share an important drawback: the transformation into feature vectors 
necessarily implies loss of information. Additionally, methods for extracting features 
may be domain-specific. 

The second category features Partitioning Around Medoids (PAM) [7], or k- 
medoids, which finds representative observations by iteratively minimizing a cost 
function based on the distances between data objects, and assigns other observations 
to the closest medoid. PAM's main limitations are that it requires the number of 
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clusters in advance and can only identify convex-shaped groups. Density-based 
spatial clustering of applications with noise [3], or DBSCAN, overcomes these two 
constraints by computing the density of data points starting from their distances, 
and defining clusters as samples of high density that are close to each other (and 
surrounded by areas of lower density). A similar approach is the DP, which is 
described in greater detail in Section 3.1. Alternatively, hierarchical clustering can 
be applied to distances between graphs, as in [13], where a spectral Laplacian-based 
distance is proposed and used. Finally, k-groups [8] is a clustering technique within 
the Energy Statistics framework [14] where the goal is minimizing the total within- 
cluster Energy distance, which is computed starting from the distances between 
original observations. 


3 Methods 


In this section, we first describe the original DP approach; then, we introduce the 
DP-KDE method, which is partly named after Kernel Density Estimation; finally, 
we discuss how to employ it for graph clustering. 


3.1 Original DP 


The density-peak algorithm [12] is based on a simple idea: since cluster centers are 
identified as the distribution's modes, they must be 1) surrounded by neighbors with 
lower density, and 2) at a relatively large distance from points with higher density. 
Consequently, two quantities are computed for each observation x;: the local density 
pi, and the minimum distance 6; from other data points with higher density. The 
local density p; of x; is defined as: 


Pi E Musa (1) 
J 


where /(. is the indicator function, di; = d(x;,x;) is the distance between x; and 
xj, and de is a cutoff distance. In simple terms, p; is the number of points that are 
closer than de to x;. The DP algorithm is robust with respect to de, at least with large 
datasets [12]. Once the density is computed, the definition of the minimum distance 
6; between point x; and any other point x; with higher density is straightforward: 


Ôi = min (dij). (2) 

JP j>Pi 
By convention, the point with highest density has 6; = max; (dij). The interpretation 
of 6; reflects the algorithm’s core idea: data points that are not local or global maxima 
have their 6; constrained by other points within the same cluster, hence cluster centers 
have large values of 6;. However, this is not sufficient: they also need to have a large p; 
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because otherwise the point could be merely distant from any other. After identifying 
cluster centers, other observations are assigned to the same cluster as their nearest 
neighbor of higher density. 

The density-peak algorithm has many favorable properties: it manages to detect 
nonspherical clusters, it does not require the number of clusters in advance or data to 
be embedded in vector spaces, itis computationally fast because it does not maximize 
explicitly each data point’s density field and it performs cluster assignment in a single 
step, it estimates a clear population quantity, and it has only one tuning parameter 
(the cutoff distance de). 


3.2 DP-KDE 


The density-peak approach also has drawbacks. Over the last few years, many articles 
have proposed improvements to overcome two main critical points: the unstable 
density estimation and the absence of an automatic procedure for selecting cluster 
centers. In this work, we explicitly tackle these two aspects. 

The unstable density estimation induced by Equation (1) has been widely shown 
[9, 16, 15]. Although many solutions have been proposed, we espouse the research 
line suggesting the use of Kernel Density Estimation (KDE) to compute p; [9, 15]: 


nta KZ). 3) 


In Equation (3), h is the bandwidth, which is a smoothing parameter, and K(-) is 
the kernel, which is a non-negative function weighting the contribution of each data 
point to the density of the i-th observation. We use the Epanechnikov kernel, which 
is normalized, symmetric, and optimal in the Mean Square Error sense [2]: 


=, 372) 
Kw) = (Oe i, |u| <1 T 


0, lu|21- 


Equation (4) implies a null contribution of observation j to the i-th density whenever 
[Gi x;)/ h| = 1, while, in the opposite case, it results in a positive weight depending 
quadratically on (x; — x;)/ h. Consequently, h may be regarded as the cutoff distance 
for the DP-KDE method. 

The automatic selection of cluster centers involves many aspects: the cutoff dis- 
tance, the number of clusters, and which data points to select. In the following, we 
use a cutoff distance h such that the average number of neighbors is between 1 and 
2% of the sample size, as suggested by [12]. The number of clusters k is here con- 
sidered as a given parameter, leaving the search for its optimal value for future work. 
Finally, the method for selecting data points as cluster centers is obtained refining 
an intuition contained in [12], where candidates are observations with sufficiently 
large values of y; = 6;p;. However, this quantity has two major drawbacks: first, if 
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ô; and p; are not defined over the same scale, results could be misleading; second, it 
implicitly assumes that ó; and p; shall be given the same weight in the decision. We 
overcome these two limitations by first normalizing both 6; and p; between 0 and 
1, and then giving them different weights that are based on their informativeness. 
We measure the latter using the Gini coefficient of the two (normalized) quantities, 
under the assumption that the least concentrated distribution between the two is the 
most informative. Specifically, each observation is given a measure of importance 
that is defined as: 


G(601) G 
Wo Soy p. (5) 


where 69; and po; are the normalized versions of 6 and p respectively, ôo1,i and po1,i 
are the corresponding i-th values, and G(x) denotes the Gini coefficient of x. Then, 
the selected cluster centers are the top k observations in terms of yr . Assigning 


observations to the same cluster as their nearest neighbor of higher density is what 
concludes the DP-KDE method. 


3.3 Graph Clustering 


A graph is a mathematical object composed of a collection of vertices linked by 
edges between them. Formally, a graph is denoted with G = (V, E), where V is 
the set of vertices and E is the set of edges. If e € E joins vertices u,v € V, i.e., 
e = {u,v}, then u and v are adjacent or neighbors. The number of edges incident with 
any vertex v is the degree of v. Each edge e € E is represented through a numerical 
value we called edge weight: if weights are equal to 1 for all and only the existent 
edges, and 0 for the others, G is unweighted; when existent edges have real-valued 
weights, G is weighted. If W{u,v} = W{v,u} for all u,v € V, the graph G is undirected, 
otherwise, it is directed. The entire information about G’s connectivity is stored in 
a |V| x |V| adjacency matrix A whose generic entry in the u-th row and v-th column 
is we, where e = {u,v} and u,v € V. 

The DP-KDE method can be used for graph clustering if a proper distance between 
graphs is defined. In this work, we employ the Edge Difference Distance [6], which is 
defined as the Frobenius norm of the difference between the two graphs' adjacency 
matrices. The choice is motivated by many factors: a flexible definition that can 
be directly applied also to directed and weighted graphs, the reasonable results it 
yields when node correspondence is a concern, and its limited computational time 
complexity. Formally, the Edge Difference Distance between two graphs x; and x; 
is defined as: 


dep (xi.x)) = lA -A'e = [9 Mas - Abal, © 
P q 


where A! and A/ are the adjacency matrices of x; and x; respectively, and || - ||r 
denotes the Frobenius norm. 
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Consequently, the two fundamental quantities of the DP-KDE method are com- 
puted in the following as: 


wv V (devi x) 
Pi SX E (7) 


j=l 


where K (-) is the Epanechnikov kernel defined in Equation (4) and the normalizing 
constant is omitted because we are simply interested in the ranking between the 
densities, and: 


6; = min (dgp(xi,x;)). (8) 
Jpj?pi 
Finally, cluster centers are selected as the observations with the largest values 
of yg. as defined in Equation (5), and other observations are assigned to the same 
cluster as their nearest neighbor in terms of 6;. 


4 Empirical Analysis 


The DP-KDE method for graph clustering is employed in an unsupervised empirical 
analysis where the ground truth is known, and its performance is compared in terms 
of accuracy both with natural competitors and with a method treating the problem 
as supervised. The ultimate goal is clustering brain connectomes, one for each 
individual, correctly distinguishing between patients affected by schizophrenia (SZ) 
and healthy controls. 

We use publicly available! data from a recent study [5] whose aim is finding 
relevant links between Regions of Interest (ROIs) for predicting schizophrenia from 
multimodal brain connectivity data. The cohort is composed of 27 schizophrenic 
patients and 27 age-matched healthy participants acting as control subjects. In the 
current work, we focus only on this cohort's functional Magnetic Resonance Imaging 
(fMRI) connectomes. Functional connectivity matrices have been computed starting 
from fMRI scans, treating them as time series, and computing Pearson's correlation 
coefficient between time series for distinct ROIs. The resulting matrices are weighted, 
undirected, and made of 83 nodes. 

The aforementioned study [5] treats every functional connectivity matrix as a 
single multivariate realization of (83 - 82)/2 = 3403 numeric variables, each repre- 
senting a connection between two of the 83 ROIs. They reduce feature dimensional- 
ity by performing Recursive Feature Elimination based on Support Vector Machines 
(SVM-RFE), and tackle the classification problem as supervised using 20 repetitions 
of nested 5-fold cross-validation. When using only functional connectivity data, they 
achieve an average accuracy of 68.28%? over the resulting 100 test sets. 


!https://doi.org/10.5281/zenodo.3758534. 


2 This exact figure is not included in the article, but the analysis is fully reproducible since the authors 
made their source code available at https: //github.com/leoguti85/BiomarkersSCHZ. 
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The approach we adopt in this work is rather different. First, graphs are analyzed 
in their original form, without any simplification to numeric variables, resulting in 
only one graph-structured variable. Observations are 54, each one representing the 
functional connectome of a different individual. We tackle the problem with an un- 
supervised classification approach seeking to cluster connectomes into two groups: 
schizophrenic and healthy. To this end, we use the DP-KDE method for graph cluster- 
ing described in Section 3.3. Starting from the 54 connectomes, each observation's 
local density p; and minimum distance ó; are computed using Equations (7) and 
(8), respectively. The centers of the two clusters are those whose yS is largest. 
Then, other observations are assigned to the same cluster as their nearest neighbor 
of higher density. Finally, the clustering performance is evaluated by comparing 
the algorithm’s assignment to the ground truth. The DP-KDE method achieves an 
accuracy of 70.37%, which is more than 2% higher than the one obtained in [5]. 

Table 1 includes the performance in terms of accuracy of both the DP-KDE 
and the SVM-RFE methods, as well as that of other graph clustering competitors. 
Specifically, we consider: the classical DP algorithm on the original data objects, 
with the same cutoff distance as in DP-KDE and manually selected cluster centers; 
k-means clustering on the 3403 numeric variables obtained from vectorizing the 
adjacency matrices; DBSCAN on the original data objects, with parameters £ = 20.2 
and 15 as the minimum number of points required to form a dense region; PAM and 
k-groups on the original data objects. In all these cases, the number of clusters has 
been kept fixed to k = 2. The method that yields the best accuracy in the specific 
problem is the DP-KDE. 


Table 1 Accuracy for DP-KDE and some of its possible competitors. 


Method | DP-KDE SVM-RFE DP k-means DBSCAN PAM k-groups 


Accuracy 70.37 68.28 62.96 62.96 61.11 62.96 62.96 


5 Concluding Remarks 


After explaining the importance of graph clustering and briefly reviewing some 
existing methods to perform this task, we have considered the possibility of adopting 
a density-peak approach. We have improved the original DP algorithm by using 
a more robust definition of the density p;, and by automatically selecting cluster 
centers based on the quantity yS we have introduced. We have also selected a proper 
distance between graphs, namely, the Edge Difference Distance. Finally, we have 
used the resulting method in an empirical analysis with the goal of clustering brain 
connectomes to distinguish between schizophrenic patients and healthy controls. 
Our method outperforms another one treating the specific task as supervised, and it 
is by far the best one with respect to many graph clustering competitors. 
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An initial idea for future work is the search for the optimal number of clusters. 
This may be achieved either by fixing a threshold for yS or by selecting all the data 
points after the largest increase in terms of yS . Also the cutoff distance could be 
tuned, possibly maximizing in some way the dispersion of points in the bivariate 
distribution of p and ô. Then, the DP-KDE method needs to be extended beyond the 
univariate case. Finally, other distances between graphs could be considered to better 
reflect alternative application-specific needs, e.g., when graphs are not defined over 
the same set of nodes. 
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Similarity Forest for Time Series Classification 


Tomasz Gorecki, Maciej Łuczak, and Paweł Piasecki 


Abstract The idea of similarity forest comes from Sathe and Aggarwal [19] and is 
derived from random forest. Random forests, during already 20 years of existence, 
proved to be one of the most excellent methods, showing top performance across a 
vast array of domains, preserving simplicity, time efficiency, still being interpretable 
at the same time. However, its usage is limited to multidimensional data. Similarity 
forest does not require such representation — it is only needed to compute similarities 
between observations. Thus, it may be applied to data, for which multidimensional 
representation is not available. In this paper, we propose the implementation of 
similarity forest for time series classification. We investigate 2 distance measures: 
Euclidean and dynamic time warping (DTW) as the underlying measure for the 
algorithm. We compare the performance of similarity forest with 1-nearest neighbor 
and random forest on the UCR (University of California, Riverside) benchmark 
database. We show that similarity forest with DTW, taking into account mean ranks, 
outperforms other classifiers. The comparison is enriched with statistical analysis. 
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1 Introduction 


Time series classification is a well-developing research field, that gained much 
attention from researchers and business during the last two decades apparently by 
the fact that more and more data around us seems to be located in the time domain — 
and thus fulfilling the definition of time series. Predictive maintenance [18], quality 
monitoring [22], stock market analysis [20] or sales forecasting [17] are just a few 
exemplar nowadays problems where time series are indeed present. The reason why 
we usually apply to time series different methods from regular (non-time series) data 
is the fact, that time series are ordered in time (or some other space with ordering) 
and it is beneficial to use the information conveyed by the ordering. 

In recent years, one could observe many advances on the field of time series 
classification. In 2017, Bagnall et al. presented a comprehensive comparison of time 
series classification algorithms [2], showing that despite there are dozens of far 
more complex methods, 1-Nearest Neighbour (INN) [6, 11] coupled with DTW [3] 
distance constitutes a good baseline. In fact, it has been outperformed by several 
classifiers, with Collective of Transformation Ensembles (COTE) [1] as the most 
efficient one. Furthermore, COTE was extended with Hierarchical Vote system, first 
to HIVE-COTE [13] and then finally to HIVE-COTE 2.0 [15] — a current state of 
the art classifier for time series. In general, the success of COTE-family classifiers 
is based on the observation, that in the case of time series it is highly beneficial 
to use different data representations. For example, HIVE-COTE 1.0 utilizes five 
ensembles based on different data transformation domains. However, a common 
criticism of such an approach is its time complexity. In the case of HIVE-COTE, 
it equals O(n7/*), where n is a number of observations and / is a length of series. 
Another drawback, especially significant for practitioners is the complex structure 
of the model ensembles that makes it hard to use HIVE-COTE without spending a 
decent amount of time studying its components beforehand. 

As an alternative to such complex models may be trying to achieve possibly 
slightly worse performance in favour of model simplicity and reduced computation 
time. A group of classifiers that seems to hold a great potential are those inspired 
by Random Forest (RF) [4]. This already 20-years old algorithm remains in the 
classifiers’ forefront, showing extremely good performance and robustness across 
multiple domains. Fernandez-Delgado et al. [10] performed a comparison of 179 
classifiers on 121 non-time series data sets originated from UCI Machine Learning 
Repository [9], concluding RF to be the most accurate one. Unfortunately, the usage 
of RF is essentially limited to multidimensional data, as they sample features from 
original space while creating each node of decision trees. 

In this paper, we propose a method for extending RF to work with time series 
using similarity forests (SF). We significantly extend the applicability of the RF 
method to time series data. Furthermore, the approach even outperforms traditional 
classifiers for time series. The main goal of this paper is to enrich the pool of time 
series classifiers by Similarity Forest for time series classification. SF was initially 
proposed by Sathe and Aggarwal in 2017 [19], as a method extending Random Forests 
to deal with arbitrary data sets, provided that we are able to compute similarities 
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between observations. We would like to implement and tune the method to time series 
data. We investigate the performance of the model using two distance measures (the 
algorithm's hyper-parameter): Euclidean and DTW. Also, a comparison with other 
selected time series classifiers is provided. We compare its performance against 
INN-ED, INN-DTW and RF. 

The rest of the paper is structured as follows. In Section 2, we provide details 
of similarity forest and we give more details about random forests. Additionally, we 
discuss how similarity forest is related to random forest. Section 3 describes data 
sets that we used and the comparison methodology. The corresponding results are 
presented in Section 4. Finally, in Section 5 we give a brief summary of our research. 


2 Classification Methods Used in Comparison 


In the paper, we compare the standard random forest and the similarity forest with 
the distance measure: ED (Euclidian distance) and DTW (dynamic time warping 
distance). As benchmark methods, we also use the nearest neighbor method (INN) 
with distance measure ED and DTW. INN-ED and INN-DTW are very common 
classification methods for time series classification [2]. For a review of these methods 
refer to [14]. 


2.1 General Method of Random Forest Construction 


Random forest consists of random decision trees. For the construction of a random 
forest we usually take decision trees as simple as possible — without special criteria 
for stopping, pruning, etc. 

When building a decision tree, we start at a node N, which contains the entire 
data set (bootstrap sample). Then, according to an established criterion, we split the 
node N into two subnodes N; and N2. In each subnode there are data subsets of 
the data set from node N. We make this split in a way that is optimal for a given 
split method. In each node, we write down how the split occurred. Then, proceeding 
recursively, we split next nodes into subnodes until the stop criterion occurs. In our 
case we take the simplest such criterion, namely we stop the split of a given node 
when only elements of the same class are included in a node. We call such a node a 
leaf and assign it a label which elements of the node (leaf) have. 

Having built a tree, we can now use it (in the testing phase) to classify a new 
observation. We pass this observation through the trained tree — starting from the 
node N selecting each time one of the subnodes, according to the condition stored 
in the node. We do this until we reach one of the leaves, and then we assign the test 
observation to the class of the leaf. 

Now, constructing the random forest, we collect a certain number of decision 
trees, train them independently according to the above method and, in the test phase, 
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use each of the trees to test new observation. Thus, each tree assigns a label to the 
test observation. The final label (for the entire forest) we construct by voting, we 
choose the most frequently appearing label among the decision trees. 


2.2 Classical Random Forest 


To create a (classical) random tree and a random forest [4], we proceed as described 
above using the following node split method: 

To obtain split conditions for a single tree, we select randomly a certain number 
of features (Vk for classification, k — number of features), and for each feature 
we create a feature vector (column, variable) made of all elements of the data set 
(bootstrap sample). For a given feature vector (variable), we determine a threshold 
vector. First, we sort values of the feature vector (uniquely — without repeating 
values). Let us name this sorted feature vector as V = (Vi, V2,...). Then we take the 
values of the split as means of successive values of the vector V: 


_ Vit Vin 


> i=1,2,.... (1) 


Ui 
Each splitting value divides the data set in node N into two subsets — the one (left) 
in which we have elements with feature values smaller than v; and the second (right) 
with other elements. Then we check the quality of such a split. 

The splitting point is chosen such that it minimizes the Gini index of the children 
nodes. If pi, p2 ... pc are the fractions of data points belonging to the c different 
classes in node N, then the Gini index of that node is given by: G(N) = 1— X7, p? 

Then, if the node N is split into two children nodes N; and N2, with nı and n2 
points, respectively, the Gini quality of the children nodes is given by: 


niG(N1) +2G(N2)_ 


GQ(Ni, N2) = AD 


Quality of the split is given by: GO(N) = G(N) - GO(N;, N2). 


2.3 Similarity Forest 


The similarity forest [19] differs from the ordinary (classical) random forest only in 
the way we split nodes of trees. Instead of selecting a certain number of features, 
we select randomly a pair of elements e1, e» with different classes. Then, for each 
element e of the subset of elements in a given node, we calculate the difference of 
the squared distances to the elements e, and e»: 


w(e) = d(e, ey — d(e, e2}, 
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where d is any fixed distance measure of the elements of the data set. We sort the 
vector w uniquely (without duplicates) creating the vector V and continue as for the 
classical decision tree. We calculate values of the split v; (1), calculate the quality 
of the node split using the Gini index (2.2) and choose the best split. In the learning 
phase, we remember in each node how the optimal split occurred (elements e1, 
€5, w(e)). In the learning phase, in each node we write down the optimal split — 
elements e1, e2, and value w(e)). 


2.4 Random Forest vs Similarity Forest 


The difference between a classical random tree and a similarity tree is that instead of 
selecting Vk of the features, we select only one pair of elements ej, e». Generally, 
we have much fewer possible node splits, which has a very good effect on the 
computation time. 

The second important difference is that in the similarity tree we use any distance 
measure between elements of the data set. Therefore, we can use distance measures 
specific to a data set. For example, for time series we can use the DTW distance, 
much better suited for calculating the distance between time series, instead of the 
Euclidean distance. 


3 Experimental Setup 


We investigated the performance of similarity forest on UCR time series repository 
[7] (128 data sets). The latest update of the UCR database introduced several data 
sets with missing observations and uneven sample lengths. However, the repository 
includes a standardized version of the database without these impediments, and that 
is the version we used. 

All data sets are split into a training and testing subset, and all parameter opti- 
mization is conducted on the training set only. We combined both parts and in the 
next step, we used 100 random train/test splits. 


4 Results 


The error rates for each classifier can be found on the accompanying website!. In 
the Table 1 we show a short summary of results, including a number of wins (draw 
is not counted as a win) and mean ranks. Taking into account mean ranks, SF-DTW 
is the best classifier, sightly ahead of RF (mean ranks correspondingly equal 2.64 


lhttps://github.com/ppias/similarity forest for. tsc 
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Table 1 Number of wins (clearly wins) and mean ranks for examined methods. 


Method | 1NN-ED INN-DTW RF SF-ED SF-DTW 


Wins 12 28 38 10 31 
Meanrank 3.59 2.89 2.69 3.19 2.64 


and 2.89). Figure 1 demonstrates comparison of error rates and ranks for classifiers. 
These results lead to a conclusion that even though there is no clear winner, the top 
efficient distances are dominated by RF and SF-based classifiers. Figure 2 shows 
scatter plots of errors for pairs of classifiers. 


SF-ED -- SF-ED 
SF-DTW SF-DTW 
RF RF 
1NN-ED 1NN-ED 
1NN-DTW 1NN-DTW 
0.00 0.25 0.50 0.75 
Errors 


Fig. 1 Comparison of error rates and ranks. 


1.00} 


SF-ED better here SF-DTW better here 


Fig. 2 Comparison of error rates. 


To identify differences between the classifiers, we present a detailed statistical 
comparison. In the beginning, we test the null hypothesis that all classifiers perform 
the same and the observed differences are merely random. The Friedman test with 
Iman & Davenport extension is probably the most popular omnibus test, and it is 
usually a good choice when comparing different classifiers [12]. The p-value from 
this test is equal to 0. The obtained p-value indicates that we can safely reject the 
null hypothesis that all the algorithms perform the same. We can therefore proceed 


Similarity Forest for Time Series Classification 171 


with the post-hoc tests in order to detect significant pairwise differences among all 
of the classifiers. 

Demšar [8] proposes the use of the Nemenyi's test [16] that compares all the 
algorithms pair-wise. For a significance level o the test determines the critical 
difference (CD). If the difference between the average ranking of two algorithms is 
greater than CD the null hypothesis that the algorithms have the same performance 
is rejected. Additionally, Demšar [8] creates a plot to visually check the differences, 
the CD plot. In the plot, those algorithms that are not joined by a line can be regarded 
as different. 

In our case, with a significance of a = 0.05 any two algorithms with a difference 
in the mean rank above 0.54 will be regarded as non equal (Figure 3). We can see 
that we have three groups of methods. In the first group we have SF-DTW, RF and 
INN-DTW, in the second we have RF, INN-DTW and SF-ED and in the last group 
we have SF-ED and 1NN-ED. Unfortunately, groups are not disjoint. The first group 
is the group with the highest accuracy of classification. Hence, SF-DTW does not 
statistically outperform RF. However, we can recommend it over RF because of 
statistically the same quality and much better computational properties. 


CD 


SF-DTW 1NN-DTW 


Fig. 3 Critical difference plot. 


5 Conclusions 


Our contribution is to implement similarity forest for time series classification using 
two distance measures: Euclidean and DTW. Comparison based on the recently 
updated UCR data repository (128 data sets) was presented. We showed that SF- 
DTW outperforms other classifiers, including INN-DTW which has been considered 
as a strong baseline hard to beat for years. The statistical comparison showed, that RF 
and SF-DTW are statistically insignificantly different, however taking into account 
mean ranks the latter one is the best one. 

There are many improvements that could be applied to the implementation that 
we propose. For example, we could test other distance measures such as LCSS [21] 
or ERP [5] that were successfully used in time series tasks. Another idea could be 
to investigate the usage of boosting algorithm. 
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Detection of the Biliary Atresia Using Deep 
Convolutional Neural Networks Based on 
Statistical Learning Weights via Optimal 
Similarity and Resampling Methods 


Kuniyoshi Hayashi, Eri Hoshino, Mitsuyoshi Suzuki, Erika Nakanishi, 
Kotomi Sakai, and Masayuki Obatake 


Abstract Recently, artificial intelligence methods have been applied in several fields, 
and their usefulness is attracting attention. These methods are techniques that corre- 
spond to models using batch and online processes. Because of advances in compu- 
tational power, as represented by parallel computing, online techniques with several 
tuning parameters are widely accepted and demonstrate good results. Neural net- 
works are representative online models for prediction and discrimination. Many 
online methods require large training data to attain sufficient convergence. Thus, 
online models may not converge effectively for low and noisy training datasets. For 
such cases, to realize effective learning convergence in online models, we introduce 
statistical insights into an existing method to set the initial weights of deep convo- 
lutional neural networks. Using an optimal similarity and resampling method, we 
proposed an initial weight configuration approach for neural networks. For a practice 
example, identification of biliary atresia (a rare disease), we verified the usefulness 
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of the proposed method by comparing existing methods that also set initial weights 
of neural networks. 


Keywords: AUC, bootstrap method, leave-one-out cross-validation, projection ma- 
trix, rare disease, sensitivity and specificity 


1 Introduction 


The core technique in deep learning corresponds to neural networks, including the 
convolutional process. Since 2012, deep learning architectures have been frequently 
used for image classification [1, 2]. More so, deep convolution neural networks 
(DCNN) are representative nonlinear classification methods for pattern recognition. 
The DCNN technique is used as a powerful framework for the entirety of image 
processing [3]. The clinical medicine field presents many opportunities to perform 
diagnoses using imaging data from patients. Therefore, DCNN techniques are ap- 
plied to enhance diagnostic quality, e.g., applying a DCNN to a chest X-ray dataset 
to classify pneumonia [2] and detecting breast cancer [4]. However, DCNN architec- 
tures involve many parameters to be learned using training data. Therefore, effective 
and efficient model development must realize effective learning convergence for 
such parameters. Notably, it is important to set the initial parameter values to achieve 
better learning convergence. Furthermore, several methods have been proposed to 
set initial parameter values in the artificial intelligence (AI) field [5, 6]. However, 
there are no clear guidelines for determining which existing methods should be used 
in different situations. Thus, we propose an efficient initial weight approach using 
existing methods from the viewpoints of optimal similarity and resampling methods. 
Using a real-world clinical biliary atresia (BA) dataset, we evaluate the performance 
of the proposed method compared with existing DCNNs. Additionally, we show the 
usefulness of the proposed method in terms of learning convergence and prediction 
accuracy. 


2 Background 


BA is a rare disease that occurs in children and is fatal unless treated early. Previous 
studies have investigated models to identify BA by applying neural networks to pa- 
tient data [7] and using an ensemble deep learning model to detect BA [8]. However, 
these models were essentially for use in medical institutions, e.g., hospitals. Gener- 
ally, certain stool colors in infants and children are highly correlated with BA [9]. In 
Japan, the maternal and child health handbook includes a stool color card so parents 
can compare their child's stool color to the information on the card. Such fecal color 
cards are widely used to detect BA because of their easy accessibility outside the 
clinical environments. However, this stool color card screening approach for BA is 
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subjective; thus, accurate and objective diagnoses are not always possible. Previ- 
ously, we developed a mobile application to classify BA and non-BA stools using 
baby stool images captured using an iPhone [10]. Here, a batch type classification 
method was used, i.e., the subspace method, originating from the pattern recognition 
field. Since BA is a rare disease, the number of events in the case group is generally 
less. Thus, when we set the explanatory variables of the target observation as the pixel 
values of a target image, the number of explanatory variables is much higher than the 
number of observations, especially the disease group. With the subspace method, we 
can efficiently discriminate such high-dimensional small-sample data. For example, 
our previous study using the subspace method to classify BA and non-BA stools 
showed that BA could be discriminated with reasonable accuracy by applying the 
proposed method to image pixel data of the stool image data captured by a mobile 
phone [10]. This application was an automated version of the stool color card from 
the maternal and child health handbook. Unlike previous studies by [7, 8], the appli- 
cation is widely available outside hospital environments. As described previously, 
DCNNS are useful for image classification, including the automatic classification of 
stool images for early BA detection. 


3 Proposed Method 


Dimension reduction and discrimination processing can be realized using the sub- 
space method and DCNN techniques. In DCNN, layers based on padding, convo- 
lution, and pooling correspond to the dimension reduction functions, and the affine 
layer performs the discrimination. The primary motivation of this study is to propose 
a method that properly sets the initial weights of the parameters in a DCNN using 
statistical approaches. Our secondary motivation is to apply the proposed method to 
real-world, high-dimensional, and small-sample clinical data. 


3.1 Description of Related Procedures of the Convolution 


For image discrimination in pattern recognition and machine learning fields, the pixel 
values of the image data are set as the explanatory variables for the target outcome. 
Here, the data to be classified correspond to a high-dimensional observation. To 
improve efficiency and demonstrate the feasibility of discriminant processing, the 
dimensionality must be reduced to a manageable size before classification. The most 
representative dimensionality reduction method is convolution in pattern recognition 
and machine learning, which involves padding, convolution, and pooling operations. 
After converting the input image to a pixel data matrix, the pixel data matrix is 
surrounded with a numeric value of 0. Using a convolution filter, we reconstruct the 
pixel data matrix while considering pixel adjacency information. Generally, the size 
and convolution filter type are parameters that need optimization to realize sufficient 
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prediction accuracy. However, some representative convolution filters that exhibit 
good performance are known in the AI field, and we can essentially fix the size and 
type of the convolution filter. Finally, pooling is performed to reduce the size of the 
pixel data matrix after convolution. Here, we refer to the sequence of processing 
from padding to pooling as the layer for feature selection. 


3.2 Setting Conditions Assumed in This Study 


We denote the input pattern matrices comprising numerical pixel values in hue (H), 
saturation (S), and value (V) as X" (e R?*4), X°(e€ R?*4), and XV (e R?*4), 
respectively. First, we performed padding for the input pattern matrices in H, S, and 
V, respectively, and then, performed a convolution in each signal pattern matrix using 
a convolution filter. Next, we then applied max pooling to each pattern matrix after 
convolution. Here, we denote the pattern matrices after the padding, convolution, 
and max pooling as X" (e RP'*4^), X^ (e RP'*4"), and X" (€ R?’*4’), respectively, 
where p’ and q’ are less than p and q. Therefore, we combine the component values of 
each pattern matrix after padding, convolution, and max pooling into a single pattern 
matrix by simply adding them together. The combined pattern matrix after applying 
the feature selection layer is expressed as X(c R^*4'). Next, we applied convolution 
and max pooling to the combined pattern matrix k times. Additionally, the input 
vector after performing the convolution and max pooling k times is denoted by 
x(c R^), and the output of the DCNN and the label vectors are denoted y(€ R!*!) 
and t(e R!*!), respectively. In this study, we evaluated the difference between y 
and t according to the mean square error function, i.e., L(y,t) — ; || t-y Id. 
Here, we consider a simple neural network with three layers. Concretely, between 
the first and second layers, we perform a linear transformation using W; (€ R2X^) 
and b; (e R?*!). Then, a linear transformation is performed using W2 (€ R!**) and 
b2(€ R!*!) between the second and third layers. Next, we defined fi(x) and f(x) 
as W,x + b, and W» fi (x) + b», respectively. Note that we assume 772 is a nonlinear 
transformation between the second and third layers, and we calculated the output 
y as go ( f» o fi(x)). Generally, y is calculated as a continuous value. For example, 
with classification and regression tree methods, we can determine the optimal cutoff 
point of ys from a prediction perspective. 


3.3 General Approach to Update Parameters in CNNs 


Here, we denote fj(x) and f2 o fi(x) in the previous subsection as u; and uz, 


respectively. By performing the partial derivative of L(y,t) with respect to W2, we 
ÓL _ OL Oy du OL _ _2y_ Oy _ mw) Ou 
aw? = ay au awe Where ay = —7(t—Y). gay = “uy > aNd 5wT 


Additionally, we calculate 72(u2) as 1/(1 + exp(—uy)) using the representative 


obtain = uj. 
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OL 
Ow; 
T T 
update W, to W5 — y? 


is calculated as 72 (u2)(1 — 72(u2)). Therefore, we 


obtain =- " (t-y)n; (u5)(1—15(u5))u;. With the learning coefficient of y2, we 


OL 
oW?" 

A :, OL _ OL Oy Ou» Ou, ðL _ 2 

with respect to W1, we can obtain BW, = By du; du, OW, where 4+ = —7(t—y), 


oy 
a = 72(U2)(1 — go (u2)), Au = WI, and Ae = 2x". Thus, we then obtain 


AV. B -5(t — yp; (u5)(1— 5 (u2)) W2 xT. With the learning coefficient of y1, we 


Then, when performing the partial derivative of L(y, t) 


update W; to W; — yı awe: 


3.4 Setting the Initial Weight Matrix in the Affine Layer 


To ensure proper learning convergence in situations with limited training datasets, we 
proposed a method using optimal similarity and bootstrap methods. Here, the number 
of training data and the training dataset are denoted n and S(2 x;), respectively, where 
xj is the j-th training observation ( takes values 1 ton). Additionally, we normalized 
each observation vector, such that its norm is one. By considering the discrimination 
problem of two groups whose outcomes are 0 and 1, respectively, we divided {x;} 
into {x;|y; = 0} and {x;ly; = 1}. Next, we defined {x;|y; = 0} and (xjly; = 1} 
as So and S4, respectively. First, we calculated the autocorrelation matrix with the 
observations belonging to So. Then, using the eigenvalues (Aso) and eigenvectors 
(ûs) for the autocorrelation matrix, we calculated the following projection matrix: 


5 ^ aT 
Po := 5 à, ü, (D 


where (5 takes values 1 to Z in Equation (1). Similarly, we calculated the autocor- 
relation matrix with the observations belonging to S1. Then, with eigenvalues (s) 
and eigenvectors (ü,,) for the autocorrelation matrix, we calculate the following 
projection matrix: 


Py := by à, ü7 Q) 


where £; takes values 1 to Z in Equation (2). Here, if the value of x! (D, — Po)x > 0, 
we classify x into $,; otherwise, we classify x into So. 

From a prediction perspective, using the leave-one-out cross-validation [11], 
we determined the optimal A and ¢ p Which are minimum values satisfying T < 
Gey dS s Aso) and T < cue AC a Ne); respectively. Here, 7 is 
a tuning parameter to be optimized using the leave-one-out cross-validation. In the 
second step, based on P, we estimated $ j 88 x P 1x;. In the third step, using existing 
approaches [5, 6], we generated , we generated normal random numbers and set an 
initial matrix, vector, and scalar as W>, b 1, and bo, respectively. Next, we extracted 


180 K. Hayashi et al. 


m observations randomly using the bootstrap method [12]. Using W>, bi, bo, anda 
bootstrap sample of size m, we estimated W5W | as follows: 


WoW, = Y (§;) — (Wabi + B2)XT GuxL) (3) 


where we estimate the inverse of xix? in Equation (3) using the naive approach from 
the diagonal elements in xix? . Additionally, using the generalized inverse approach, 
we obtained W,; in the basis of W; and WW). Finally, by, bo, Wi, and W> were 
used as initial vectors and matrices to update the parameters of the convolutional 
neural network. 


4 Analysis Results on Real-world Data 


In this paper, all analyses were performed using R version 4.1.2 (R Foundation for 
Statistical Computing). We applied the proposed method to a real BA dataset. Here, 
stool image data with objects, such as diapers partially photographed on the image 
were used. In this numeric experiment, we randomly divided 35 data into 15 training 
and 20 test data, respectively. Next, we compared the proposed and existing methods 
relative to the learning convergence and prediction accuracy on the training and test 
data, respectively. Here, we set the values of the learning coefficients yı and y2 to 
0.1, respectively. Also, we prepared a single feature selection layer and performed the 
convolution and max pooling process seven times. Each time an initial value was set 
randomly, learning was performed 1000 times using the 15 training data, and it was 
judged that learning converged when the value obtained by dividing the sum of the 
absolute values of the difference between $ ; and t; by 1000 became less than 0.01. 
We repeated to randomly divide 35 data into 15 training and 20 test data five times. 
As aresult, we created five datasets. For each dataset, the sensitivity, specificity, and 
AUC values of the training and test data were calculated using the parameters (by, bo, 
W), and W») at the time the learning first converged in the existing and our proposed 
methods. Figure 1 shows the average of the five absolute values of the difference 
between the correct label and the predicted value at each step when learning was 
first converged for each method. We can observe that the error decreased steadily as 
the proposed method progressed compared to the existing methods. When the model 
was constructed using the weights at the learning convergence point and applied to 
15 training data every time, the average values of sensitivity and specificity were 
100.0%, and that of the AUC value was 1.000 for all methods. However, a difference 
was observed among the compared methods on the test data. For the method by [5], 
the average values of sensitivity, specificity, and AUC in the test data were 83.396, 
42.5%, and 0.629, respectively. Also, for that of [6], the average values of sensitivity, 
specificity, and AUC in the test data were 85.0%, 40.0%, and 0.625, respectively. 
With the proposed method, the average values of sensitivity, specificity, and AUC 
obtained on the test data were 85.0%, 67.5%, and 0.763, respectively. 
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Fig. 1 Transition of learning in each method. 


5 Conclusion and Limitations 


In this paper, we considered a discrimination problem using a DCNN for high- 
dimensional small sample data and proposed a method by setting the initial weight 
matrix in the affine layer. In situations of limited learning data, although transfer 
learning can be used, we proposed an efficient learning method using the DCNN 
method. In terms of learning convergence and results obtained from the test data, 
we confirm that the proposed method is good. However, the results presented in this 
paper are limited and the proposed method needs to be examined in more detail. 
Therefore, in the future, through large-scale simulation studies and other real-world 
data applications, we plan to investigate the differences between the proposed method 
and existing methods by changing the number of feature selection layers and using 
different convolution filters. We also plan to investigate the proposed method by 
considering robustness and setting outliers on the simulation data. 
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Some Issues in Robust Clustering 


Christian Hennig 


Abstract Some key issues in robust clustering are discussed with focus on the 
Gaussian mixture model based clustering, namely the formal definition of outliers, 
ambiguity between groups of outliers and clusters, the interaction between robust 
clustering and the estimation of the number of clusters, the essential dependence 
of (not only) robust clustering on tuning decisions, and shortcomings of existing 
measurements of cluster stability when it comes to outliers. 


Keywords: Gaussian mixture model, trimming, noise component, number of clus- 
ters, user tuning, cluster stability 


1 Introduction 


Cluster analysis is about finding groups in data. Robust statistics is about methods 
that are not affected strongly by deviations from the statistical model assumptions or 
moderate changes in a data set. Particular attention has been paid in the robustness 
literature to the effect of outliers. Outliers and other model deviations can have a 
strong effect on cluster analysis methods as well. There is now much work on robust 
cluster analysis, see [1, 19, 9] for overviews. 

There are standard techniques of assessing robustness such as the influence func- 
tion and the breakdown point [15] as well as simulations involving outliers, and these 
have been applied to robust clustering as well [19, 9]. 

Here I will argue that due to the nature of the cluster analysis problem, there are 
issues with the standard reasoning regarding robustness and outliers. 

The starting point will be clustering based on the Gaussian mixture model, for 
details see [3]. For this approach, n observations are assumed i.i.d. with density 
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x € RP, with K mixture components with proportions 7, Y,,,y, being the Gaussian 
density with mean vectors ug, covariance matrices Xy, k = 1,...,K,7 being a vector 
of all parameters. For given K, 7 can be estimated by maximum likelihood (ML) 
using the EM-algorithm, as implemented for example in the R-package "mclust". 
A standard approach to estimate K is the optimisation of the Bayesian Information 
Criterion (BIC). Normally, mixture components are interpreted as clusters, and 
observations xj, i = 1,...,n,canbeassignedto clusters using the estimated posterior 
probability that x; was generated by mixture component k. A problem with ML 
estimation is that the likelihood degenerates if all observations assigned to a mixture 
component lie on a lower dimensional hyperplane, i.e, a 2 has an eigenvalue of 
Zero. This can be avoided by placing constraints on the eigenvalues of the covariance 
matrices [8]. Alternatively, a non-degenerate local optimum of the likelihood can 
be used, and if this cannot be found, constrained covariance matrix models (such as 
XQ =... = Xx) can be fitted instead, as is the default of mclust. Several issues with 
robustness that occur here are also relevant for other clustering approaches. 


2 Outliers vs Clusters 


It is well known that the sample mean and sample covariance matrix as estimators 
of the parameters of a single Gaussian distribution can be driven to breakdown by 
a single outlier [15]. Under a Gaussian mixture model with fixed K, an outlier must 
be assigned to a mixture component k and will break down the estimators of uk, X 
(which are weighted sample means and covariance matrices) for that component in 
the same manner; the same holds for a cluster mean in k-means clustering. 

Addressing this issue, and dealing with more outliers in order to achieve a high 
breakdown point, is a starting point for robust clustering. Central ideas are trimming 
a proportion of observations [7], adding a “noise component" with constant density 
to catch the outliers [4, 3], mixtures with more robust component-wise estimators 
such as mixtures of heavy-tailed distributions (Sec. 7 of [18]). 

But cluster analysis is essentially different from estimating a homogeneous popu- 
lation. Given a data set with K clear Gaussian clusters and standard ML-clustering, 
consider adding a single outlier that is far enough away from the clusters. Assuming 
a lower bound on covariance matrix eigenvalues, the outlier will form a one-point 
cluster, the mean of which will diverge with the added outlier, and the original 
clusters will be merged to form K — 1 clusters [10]. 

The same will happen with a group of several outliers being close together, 
once more added far enough away from the Gaussian clusters. "Breakdown" of an 
estimator it is usually understood as the estimator becoming useless. It is questionable 
that this is the case here. In fact, the "group of outliers" can well be interpreted as 
a cluster in its own right, and putting all these points together in a cluster could be 
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seen as desirable behaviour of the ML estimator, at least if two of the original K 
clusters are close enough to each other that merging them will produce a cluster that 
is fairly well fitted by a single Gaussian distribution; note that the Gaussian mixture 
model does not assume strong separation between components, and a mixture of 
two Gaussians may be unimodal and in fact very similar to a single Gaussian. A 
breakdown point larger than a given a, 0 < a < 5 may not be seen as desirable in 
cluster analysis if there can be clusters containing a proportion of less than a of the 
data, as a larger breakdown point will stop a method from taking such clusters (when 
added in large distance from the rest of the data) appropriately into account. 

The core problem is that it is not clear what distinguishes a group of outliers 
from a legitimate cluster. I am not aware of any formal definition of outliers and 
clusters in the literature that allows this distinction. Even a one-point cluster is not 
necessarily invalid. Here are some possible and potentially conflicting aspects of 
such a distinction. 


* Acertain minimum size may be required for a cluster; smaller groups of points 
may be called outliers. 

* Groups of points in low density areas of the data may be called outliers. Note that 
this particularly means that very widely spread Gaussian mixture components 
would also be defined as outliers, deviating from the standard interpretation of 
Gaussian mixture components as clusters. 

* Members of non-Gaussian mixture components may be called outliers. This does 
not seem to be a good idea, because Gaussianity cannot be assessed for too small 
groups of observations, and furthermore in practice model assumptions are never 
perfectly fulfilled, and it may be desirable to interpret homogeneous or unimodal 
non-Gaussian parts of the data as “cluster” and fit them by a Gaussian component. 

* The term “outlier” suggests that outliers lie far away from most other observa- 
tions, so it may be required that outliers are farther away from the clusters than 
the clusters are from each other. But this would be in conflict with the intuition 
that strong separation is usually seen as a desirable feature for well interpretable 
clusters. It may only be reasonable in applications in which there is prior informa- 
tion that there is limited variation even between clusters, as is implied by certain 
Bayesian approaches to clustering [17]. 

* The term “cluster” may be seen as flexible enough that a definition of an outlier 
is not required. Clustering should accommodate whatever is “outlying” by fitting 
it by one or more further clusters, if necessary of size one (single linkage clus- 
tering can be useful for outlier detection, even though it is inappropriate for most 
clustering problems). 


Most of these items require specific decisions that cannot be made in any objective 
and general manner, but only taking into account subject matter information, such 
as the minimum size of valid clusters or the density level below which observations 
are seen as outliers (potentially compared to density peaks in the distribution). This 
implies that an appropriate treatment of outliers in cluster analysis cannot be expected 
to be possible without user tuning. 
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3 Robustness and the Number of Clusters 


The last item suggests that there is an interplay between outlier identification and the 
number of clusters, and that adding clusters might be a way of dealing with outliers; 
as long as clusters are assumed to be Gaussian, a single additional component may 
not be enough. More generally, concentrating robustness research on the case of 
fixed K may be seen as unrealistic, because K is rarely known, although estimating 
K is a notoriously difficult problem even without worrying about outliers [13]. 

The classical robustness concepts, breakdown point and influence function, as- 
sume parameters from R^ with fixed q. If K is not fixed, the number of parameters 
is not fixed either, and the classical concepts do not apply. 

As an alternative to the breakdown point, [11] defined a "dissolution point". 
Dissolution is measured in terms of cluster memberships of points rather than in 
terms of parameters, and is therefore also applicable to nonparametric clustering 
methods. Furthermore, dissolution applies to individual clusters in a clustering; 
certain clusters may dissolve, i.e., there may be no sufficiently similar cluster in a 
new clustering computed after, e.g., adding an outlier; and others may not dissolve. 
This does not require K to be fixed; the definition is chosen so that if a clustering 
changes from K to L « K clusters, at least K — L clusters dissolve. 

Hennig [10, 11] showed that when estimating K using the BIC and standard ML 
estimation, reasonably well separated clusters do not dissolve when adding possibly 
even a large percentage of outliers (this does not hold for every method to estimate 
the number of clusters, see [11]). Furthermore, [11] showed that no method with 
fixed K can be robust for data in which K is misspecified - already [7] had found 
that robustness features in clustering generally depend on the data. 

An implication of these results is that even in the fixed K problem, the standard 
ML method can be a valid competitor regarding robustness if it comes with a rule 
that allows to add one or possibly more clusters that can then be used to fit the 
outliers (this is rarely explored in the literature, but [18], Sec. 7.7, show an example 
in which adding a single component does not work very well). 

An issue with adding clusters to accommodate outliers is that in many applications 
it is appropriate to distinguish between meaningful clusters, and observations that 
cannot be assigned to such clusters (often referred to as "noise"). Even though adding 
clusters of outliers can formally prevent the dissolution of existing clusters, it may 
be misleading to interpret the resulting clusters as meaningful, and a classification 
as outliers or noise can be more useful. This is provided by the trimming and noise 
component approaches to robust clustering. Also some other clustering methods such 
as the density-based DBSCAN [5] provide such a distinction. On the other hand, 
modelling clusters by heavy-tailed distributions such as in mixtures of t-distributions 
will implicitly assign outlying observations to clusters that potentially are quite far 
away. For this reason, [18], Sec. 7.7, provide an additional outlier identification 
rule on top of the mixture fit. [6] even distinguish between “mild” outliers that are 
modelled as having a larger variance around the same mean, and "gross" outliers to 
be trimmed. The variety of approaches can be connected to the different meanings 
that outliers can have in applications. They can be erroneous, they can be irrelevant 
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noise, but they can also be caused by unobserved but relevant special conditions (and 
would as such qualify as meaningful clusters), or they could be valid observations 
legitimately belonging to a meaningful cluster that regularly produces observations 
further away from the centre than modelled by a Gaussian distribution. 

Even though currently there is no formal robustness property that requires both the 
estimation of K and an identification or downweighting of outliers, there is demand 
for a method that can do both. 

Estimating K comes with an additional difficulty that is relevant in connection 
with robustness. As mentioned before, in clustering based on the Gaussian mixture 
model normally every mixture component will be interpreted as a cluster. In reality, 
however, meaningful clusters are not perfectly Gaussian. Gaussian mixtures are very 
flexible for approximating non-Gaussian distributions. Using a consistent method 
for estimating K means that for large enough n a non-Gaussian cluster will be 
approximated by several Gaussian mixture components. The estimated K will be 
fine for producing a Gaussian mixture density that fits the data well, but it will 
overestimate the number of interpretable clusters. The estimation of K, if interpreted 
as the number of clusters, relies on precise Gaussianity of the clusters, and is as such 
itself riddled with a robustness problem; in fact slightly non-Gaussian clusters may 
even drive the estimated K — œ if n — oo [12, 14]. 

This is connected with the more fundamental problem that there is no unique 
definition of a cluster either. The cluster analysis user needs to specify the cluster 
concept of interest even before robustness considerations, and arguably different 
clustering methods imply different cluster concepts [13]. A Gaussian mixture model 
defines clusters by the Gaussian distributional shape (unless mixture components 
are merged to form clusters [12]). Although this can be motivated in some real situ- 
ations, robustness considerations require that distributional shapes fairly close to the 
Gaussian should be accepted as clusters as well, but this requires another specifica- 
tion, namely how far from a Gaussian a cluster is allowed to be, or alternatively how 
separated Gaussian components have to be in order to count as separated clusters. A 
similar problem can also occur in nonparametric clustering; if clusters are associated 
with density modes or level sets, the cluster concept depends on how weak a mode 
or gap between high level density sets is allowed to be to be treated as meaningful. 

Hennig and Coretto [14] propose a parametric bootstrap approach to simultane- 
ously estimate K and assign outliers to a noise component. This requires two basic 
tuning decisions. The first one regards the minimum percentage of observations so 
that a researcher is willing to add another cluster if the noise component can be re- 
duced by this amount. The second one specifies a tolerance that allows a data subset 
to count as a cluster even though it deviates to some extent from what is expected 
under a perfectly Gaussian distribution. There is a third tuning parameter that is in 
effect for fixed K and tunes how much of the tails of a non-Gaussian cluster can be 
assigned to the noise in order to improve the Gaussian appearance of the cluster. One 
could even see the required constraints on covariance matrix eigenvalues as a further 
tuning decision. Default values can be provided, but situations in which matters can 
be improved deviating from default values are easy to construct. 
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4 More on User Tuning 


User tuning is not popular, as it is often difficultto make appropriate tuning decisions. 
Many scientists believe that subjective user decisions threaten scientific objectivity, 
and also background knowledge dependent choices cannot be made when investigat- 
ing a method's performance by theory and simulations. The reason why user tuning 
is indispensable in robust cluster analysis is that it is required in order to make the 
problem well defined. The distinction between clusters and outliers is an interpre- 
tative one that no automatic method can make based on the data alone. Regarding 
the number of clusters, imagine two well separated clusters (according to whatever 
cluster concept of interest), and then imagine them to be moved closer and closer 
together. Below what distance are they to be considered a single cluster? This is 
essentially a tuning decision that the data cannot make on their own. 

There are methods that do not require user tuning. Consider the mclust imple- 
mentation of Gaussian mixture model based clustering. The number of clusters is by 
default estimated by the BIC. As seen above, this is not really appropriate for large 
data sets, but its derivation is essentially asymptotic, so that there is no theoretical 
justification for it for small data sets either. Empirically it often but not always works 
well, and there is little investigation of whether it tends to make the "right" decision 
in ambiguous situations where it is not clear without user tuning what it even means 
to be "right". Covariance matrix constraints in mclust are not governed by a tuning of 
eigenvalues or their ratios to be specified by the user. Rather the BIC decides between 
different covariance matrix models, but this can be erratic and unstable, as it depends 
on whether the EM-algorithm gets caught in a degenerate likelihood maximum or 
not, and in situations where two or more covariance matrix models have similar BIC 
values (which happens quite often), a tiny change in the data can result in a different 
covariance matrix model being selected, and substantial changes in the clustering. A 
tunable eigenvalue condition can result in much smoother behaviour. When it comes 
to outlier identification, mclust offers the addition of a uniform “noise” mixture 
component governed by the range of the data, again supposedly without user tuning. 
This starts from an initial noise estimation that requires tuning (Sec. 3.1.2 of [3]) and 
is less robust in terms of breakdown and dissolution than trimming and the improper 
noise component, both of which require tuning [10, 11]. The ICL, an alternative to 
the BIC (Sec. 2.6 of [3]), on the other hand, is known to merge different Gaussian 
mixture components already at a distance at which they intuitively still seem to 
be separated clusters. Similar comments apply to the mixture of t-distributions; it 
requires user tuning for identifying outliers, scatter matrix constraints, and it has the 
same issues with BIC and ICL as the Gaussian mixture. 

Summarising, both the identification of and robustness against outliers and the 
estimation of the number of clusters require tuning in order to be well defined 
problems; user tuning can only be avoided by taking tuning decisions out of the 
user’s hands and making them internally, which will work in some situations and 
fail in others, and the impression of automatic data driven decision making that a 
user may have is rather an illusion. This, however, does not free method designers 
from the necessity to provide default tunings for experimentation and cases in which 
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the users do not feel able to make the decisions themselves, and tuning guidance for 
situations in which more information is available. A decision regarding the smallest 
valid size of a cluster is rather well interpretable; a decision regarding admissible 
covariance matrix eigenvalues is rather difficult and abstract. 


5 Stability Measurement 


Robustness is closely connected to stability. Both experimental and theoretical inves- 
tigation of the stability of clusterings require formal stability measurements, usually 
comparing two clusterings on the same data (potentially modified by replacing or 
adding observations). Not assuming any parametric model, proximity measures such 
as the Adjusted Rand Index (ARI; [16]), the Hamming distance (HD; [2]), or the 
Jaccard distance between individual clusters [11] can be used. Note that [2], standard 
reference on cluster stability in the machine learning community, state that stability 
and instability are caused in the first place by ambiguities in the cluster structure 
of the data, rather than by a method's robustness or lack of it. Although the outlier 
problem is ignored in that paper, it is true that cluster analysis can have other stability 
issues that are as serious as or worse than gross outliers. 

To my knowledge, none of the measures currently in use allow for a special 
treatment of a set of outliers or noise; either these have to be ignored, or treated just 
as any other cluster. Both ARI and HD, comparing clusterings Cı and C5, consider 
pairs of observations x;,x; and check whether those that are in the same cluster 
in Cj are also in the same cluster in C». An appropriate treatment of noise sets 
Nı € Cj, N € C2 would require that x;, x; € Nj are not just in the same cluster in 
C» but rather in N5, i.e., whereas the numberings of the regular clusters do not have 
to be matched (which is appropriate because cluster numbering is meaningless), N1 
has to be matched to N2. Corresponding re-definitions of these proximities will be 
useful to robustness studies. 


6 Conclusion 


Key practical implications of the above discussions are: 


* Outliers can be treated as forming their own clusters, or be collected in out- 
lier/noise or trimmed sets, or be integrated in clusters of non-outliers. Which of 
these is appropriate depends on the nature of outliers in a given application. 

* Methods that do not identify outliers but add clusters in order to accommodate 
them are valid competitors of robust clustering methods, as are nonparametric 
density-based methods. 

* Clusteranalysis involving estimating the number of clusters and robustness require 
tuning in order to define the problem they are meant to solve well. Method 
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developers need to provide sensible defaults, but also to guide the users regarding 
a meaningful interpretation of the tuning decisions. 
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Robustness Aspects of Optimized Centroids 


Jan Kalina and Patrik Janácek 


Abstract Centroids are often used for object localization tasks, supervised seg- 
mentation in medical image analysis, or classification in other specific tasks. This 
paper starts by contributing to the theory of centroids by evaluating the effect of 
modified illumination on the weighted correlation coefficient. Further, robustness 
of various centroid-based tools is investigated in experiments related to mouth lo- 
calization in non-standardized facial images or classification of high-dimensional 
data in a matched pairs design. The most robust results are obtained if the sparse 
centroid-based method for supervised learning is accompanied with an intrinsic vari- 
able selection. Robustness, sparsity, and energy-efficient computation turn out not to 
contradict the requirement on the optimal performance of the centroids. 


Keywords: image processing, optimized centroids, robustness, sparsity, low-energy 
replacements 


1 Introduction 


Methods based on centroids (templates, prototypes) are simple yet widely used for 
object localization or supervised segmentation in image analysis tasks and also within 
other supervised or unsupervised methods of machine learning. This is true e.g. in 
various biomedical imaging tasks [1], where researchers typically cannot afford a too 
large number of available images [3]. Biomedical applications also benefit from the 
interpretability (comprehensibility) of centroids [11]. 

This paper is focused on the question how are centroid-based methods influenced 
by data contamination. Section 2 recalls the main approaches to centroid-based 
object localization in images, as well as a recently proposed method of [6] for op- 
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timizing centroids and their weights. The performance of these methods to data 
contamination (non-standard conditions) has not been however sufficiently investi- 
gated. Particularly, we are interested in the performance of low-energy replacements 
of the optimal centroids and in the effect of posterior variable selection (pixel selec- 
tion). Section 2.1 presents novel expressions for images with a changed illumination. 
Numerical experiments are presented in Section 3. These are devoted to mouth lo- 
calization over raw facial images as well as over artificially modified images; other 
experiments are devoted to high-dimensional data in a matched pairs design. The 
optimized centroids of [6] and especially their modification proposed here turn out 
to have remarkable robustness properties. Section 4 brings conclusions. 


2 Centroid-based Classification (Object Localization) 


Commonly used centroid-based approaches to object localization (template match- 
ing) in images construct the centroid simply as the average of the positive examples 
and typically use Pearson product-moment correlation coefficient r as the most com- 
mon measure of similarity between a centroid c and a candidate part of the image 
(say x). While the centroid and candidate areas are matrices of size (say) J x J pixels, 
they are used in computations after being transformed to vectors of length d := IJ. 


This allows us to use the notation c = (c1,...,c4)7 and x = (x1,...,x4)T. 
Assumptions A: We assume the whole image to have size Nr x Nc pixels. We 
assume the centroid c = (c);,; with i = 1,..., 7 and j = 1,...,J to be a matrix of 


size I X J pixels. A candidate area x and nonnegative weights w with 5»; X}; wij = 1 
are assumed to be matrices of the same size as c. 

For a given image, E will denote the set of its rectangular candidate areas of size 
I x J. The candidate area fulfilling 


arg max r(x, c) (1) 
xeE 
or (less frequently) 
arg min ||x — e|]; (2) 
xcE 


are classified to correspond to the object (e.g. mouth). 
Let us consider here replacing r by the weighted correlation coefficient rw 


arg max ry (x, c; w) (3) 
xeE 


with given non-negative weights w = (wi,...,wq)! € RP with X ,w; = 1, 
where R denotes the set of all real numbers. Let us further use the notation x, = 
now wjx; = wT x and é,, = wl c. We may recall ry between x and c to be defined 
as 

pid wi(xi — Xw)(ci — Ew) 


SF [ws (x; — Fn)?] ETE 


(4) 


rw (X,¢; w) = 
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Initial centroid Optimal centroid Optimal centroid 
Initial weights Initial weights Optimal weights 


Fig. 1 The workflow of the optimization procedure of [6]. 


A detailed study of [2] investigated theoretical foundations of centroid-based classi- 
fication, however for the rare situation when (1) is replaced by 

The sophisticated centroid optimization method of [6], outlined in Figure 1, 
requires to minimize a nonlinear loss function corresponding to aregularized margin- 
like distance (exploiting rw) evaluated for the worst pair from the worst image over 
the training database (i.e. the worst with respect to the loss function). Subsequently, 
optimization of the weights may be also performed, ensuring many pixels to obtain 
zero weights (i.e. yielding a sparse solution). The optimal centroid may be used 
as such, even without any weights at all; still, optimization of the weights leads 
to a further improvement of the classification performance. In the current paper, 
we always consider a linear (i.e. approximate) approach to centroid optimization, 
although a nonlinear optimization is also successful as revealed in the comparisons 
in [6]. 


2.1 Centroid-Based Object Localization: Asymmetric Modification 
of the Candidate Area 


In the context of object localization as described above, our aim is to express 
ry(X*, c; w) under modified candidate areas (say x*) of the image x; we stress that 
the considered modification of the image does not allow to modify the centroid c and 
weights w. These considerations are useful for centroid-based object localization, 
when asymmetric illumination is present in the whole image or its part. The weighted 
variance S2 (x; w) of x with weights w and the weighted covariance S,, (x, c) between 
x and c are denoted as 


SoS X wij ij - 3w), Sw(x,e) = X wy -Xw)(cij-€w). (5) 


i,j i,j 


Further, the notation x + a with x = (x;j);,; is used to denote the matrix (x;j + a)i, j 
for a given a € R. We also use the following notation. The image x is divided to two 
parts x = (x1, x2)". € RT, where Y, or X; denote the sum over the pixels of the 
first or second part, respectively. 


Theorem 1 Under Assumptions A, the following statements hold. 


1. For x* = x + e, it holds ry(x*, ©) = ry(x,c) for e > 0. 
2. For x* = kx with k > 0, it holds ry(x*, €) = ry(x,c). 
3. For x = (xj, x2)? and x* = (xi, xo + £)", it holds r,(x',€) = 
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Sw(X, C) + € X; wijcij — £U2€w 


- . (6) 
Su (c)4/ S2, (x) + vo(1 — v2 )e? + 2£Qv; — 1) (Y; wijxij — Vw) 
where v = diy, Wij and e € R. 
4. For x = (xy, x2)? and x* = (xi, kx2)!. with k > 0, it holds 
k-1 ijXij(Cij — Cw 
TT OD , Eo eur tutu Fw) oq 


Sa)” Sw () Su(x) 


where 


2 
2. 
(55,69)? = S2 (x) + (k - 1) oy Wi jXZ; — a » sn - 
II II 


-Zq = 1) È sn (>: 77 5 (8) 
I II 


The proofs of the formulas are technical but straightforward exploiting known 
properties of r,,. The theorem reveals ry to be vulnerable to the modified illumina- 
tion, i.e. all the methods based on centroids of Section 2 may be too influenced by 
the data modification. 


3 Experiments 
3.1 Data 


Three datasets are considered in the experiments. In the first dataset, the task is to 
localize the mouth in the database containing 212 grey-scale 2D facial images of faces 
of healthy individuals of size 192 x 256 pixels. The database previously analyzed 
in [6] was acquired at the Institute of Human Genetics, University of Duisburg- 
Essen, within research of genetic syndrome diagnostics based on facial images [1] 
under the projects BO 1955/2-1 and WU 314/2-1 of the German Research Council 
(DFG). We consider the training dataset to consist of the first 124 images, while the 
remaining 88 images represent an independent test set acquired later but still under 
the same standardized conditions fulfilling assumptions of unbiased evaluation. The 
centroid described below is used with J = 26 and J = 56. 

Using always raw training images, the methods are applied not only to the raw test 
set, but also to the test set after being artificially modified using models inspired by 
Section 2.1. On the whole, five different versions of the test database are considered; 
the modifications required that we first manually localized the mouths in the test 
images: 


1. Raw images. 
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2. Illumination. If we consider a pixel [i, j] with intensity x;; in an image (say) f, 


W 


then the grey-scale intensity f;; will be 


fij = fij + Al — jol, DU. le PS het eI, (9) 


where [io, jo] are the coordinates of the mouth and A = 0.002. 


. Amore severe version of the modification (ii) with 2 = 0.004. 
. Asymmetry. In every test image, each true mouth x of size 26 x 56 pixels with 


intensities x;; is replaced by 


xij + 0.2, D-1,4526,j 21,15 
LET i=1,...,26, j =16,...,41, (10) 
xi; +0.1, i=1,...,26, j =42,...,56. 


. Rotation. Such candidate area is classified as the mouth in the given image, 


which maximizes the loss (1) or (3) over the three versions of the image, namely 
after rotations by +5, 0, and —5 degrees. 


. Image denoising (for raw images). The LWS-filter [5], replacing each grey 


value by the least weighted squares estimate [7] computed from a circular 
neighborhood with radius 4 pixels, was applied to each test image. 


The optimized centroids were explained in [6] to be applicable also to classifi- 


cation tasks for other data than images, if they follow a matched pairs design. We 
use two datasets from [6] in the experiments and their classification accuracies are 
reported in a 10-fold cross validation. 


AMI. The gene expressions of 4000 genes over 92 individuals in two versions (raw 
or contaminated by outliers). The aim is to learn a classification rule allowing to 
assign a new individual to one of the two given groups (controls or patients with 
acute myocardial infarction (AMD). 

Simulated data. The design mimicks a 1:1 matched case-control study with 2500 
variables over 60 individuals in two versions (raw or contaminated by outliers) 
and the aim is again to classify between two given groups (patients and controls). 


Fig. 2 The average centroid used as the initial choice for the centroid optimization. 
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3.2 Methods 


The following methods are compared in the experiments; standard methods are 
computed using R software and we use our own C++ implementation of centroid- 
based methods. The average centroid is obtained as the average of all mouths of the 
training set, or the average across all patients. The centroid optimization starts with 
the average centroid as the initial one, and the optimization of weights starts with 
equal weights as the initial ones: 


A.  Centroid-based method (2). 

B. Centroid-based method (1) with average centroid (Figure 2) and equal weights. 

C. Centroid-based method (1) with average centroid, replacing rw by cosine sim- 
ilarity defined for x € &7 and y € & as 


£ d Xiyi 
"RENS S Bi soy a1) 


= 1/2 
IIxllollyllo (zé) (4y 


D. Centroid-based method (1) with optimal centroid and equal weights [6]. 

E. Centroid-based method (1) with optimal centroid and optimal weights as in 
[6] (optimizing the centroid and only after that the weights), i.e. with posterior 
variable selection (pixel selection). 

F. Centroid-based method (1) as in [6], where however the weights are optimized 
first, and then the centroid is optimized. 

G. Centroid-based method (1) as in [6], where however each step of centroid 
optimization is immediately followed by optimization of the weights; this method 
performs (in contrary to [6]) intrinsic variable selection. 

H. Centroid-based method (1) as in [6], where however each optimization step 
proceeds over 10 worst images (instead of the very worst image). 

I Centroid-based method (1) with average centroid, where rw is used as riws [7] 
with weight function 


2 
(i =ef- ZPNELI (12) 


corresponding to a (trimmed) density of the Gaussian N(0, 1) distribution; 1 de- 
notes an indicator function. To explain, the computation of riws(x, y) starts by 
fitting the LWS estimator in the linear regression of y as the response of x, and 
ry is used with the weights determined by the LWS estimator. 

J. The method (I) with the weight function w2(t) = 1 [t « 3] for t € [0,1]. 

K. The approach of [12] that is meaningful however only for the mouth localization 
dataset. 
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Table1 Classification accuracy for three datasets. For the mouth localization data, modifications of 
the test images are described in Section 3: (i) None (raw images); (ii) Illumination; (iii) Asymmetry; 
(iv) Rotation; (v) Image denoising. A detailed description of the methods is given in Section 3.2. 


Dataset 
Mouth localization AMI Simul. 

Method  |(i) (ii) (iii) (iv) (v) (vi) Raw Cont. |Raw Cont. 
A 0.90 0.86 0.81 0.88 0.81 0.93 0.73 0.66 |0.71 0.67 
B 0.93 0.90 0.86 0.92 0.86 0.95 0.76 0.70 |0.77 0.70 
C 0.89 0.84 0.74 0.89 0.84 0.93 0.72 | 0.61 0.70 | 0.64 
D 1.00 098 0.95 0.99 0.93 0.98 0.85 0.83 0.80 0.77 
E 1.00 1.00 0.98 1.00 0.95 0.98 0.87 0.85 0.83 0.80 
F 1.00 0.98 0.96 1.00 0.89 0.97 0.86 0.82 |0.79 0.73 
G 1.00 0.96 0.95 1.00 0.93 0.99 |0.88 0.85 0.86 0.82 
H 1.00 1.00 0.98 1.00 0.92 0.96 |0.86 0.83 0.84 0.79 
I 0.96 0.96 0.93 0.99 0.94 0.96 1077 0.72 |0.75 0.71 
J 0.94 0.93 0.89 0.95 0.89 0.93 0.74 0.69 0.72 0.66 
K 1.00 1.00 097 0.95 0.97 0.96 Not meaningful 


3.3 Results 


The results as ratios of correctly classified cases are presented in Table 1. For the 
mouth localization, the optimized centroids of methods D, F, and H turn out to out- 
perform simple centroids (A, B, and C); the novel modifications E and G performing 
intrinsic variable selection yield the best results. Simple standard centroids (A, B, 
and C) are non-robust to data contamination; this follows from Section 2.1 and from 
analogous considerations for other types of contaminating the images. On the other 
hand, the robustness of optimized centroids is achieved by their optimization (but 
not by using rw as such). Methods E and G are even able to overcome methods I 
and J based on rj ws. We recall that rzw s is globally robust in terms of the break- 
down point [4]), is computationally very demanding, and does not seem to allow 
any feasible optimization. Other results reported previously in [6] revealed that also 
numerous standard machine learning methods are too vulnerable (non-robust) with 
respect to data contamination, if measuring the similarity by r or ry. 

For the AMI dataset, methods E and G with variable selection perform the best 
results for raw as well as contaminated datasets. For the simulated data, the method G 
yields the best results and the method E stays only slightly behind as the second best 
method. 


4 Conclusions 


Understanding the robustness of centroids represents a crucial question in image 
processing with applications for convolutional neural networks (CNNs), because 
centroids are very versatile tools that may be based on deep features learned by deep 
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learning. We focus on small datasets, for which CNNs cannot be used [10]. This 
paper is interested in performance of centroid-based object localization over small 
databases with non-standardized images, which commonly appear e.g. in medical 
image analysis. 

The requirements on robustness with respect to modifications of the images turn 
out not to contradict the requirements on optimality of the centroids. The method G 
applying an intrinsic variable selection on the optimal centroid and weights [6] 
can be interpreted within a broader framework of robust dimensionality reduction 
(see [8] for an overview) or low-energy approximate computation. Additional results 
not presented here reveal the method based on optimized centroids to be robust also 
to small shift. Neither the theoretical part of this paper nor the experiments exploit 
any specific properties of faces. The presented robust method has potential also for 
various other applications, e.g. for deep fake detection by centroids, robust template 
matching by CNNs [9], or applying filters in convolutional layers of CNNs. 
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Data Clustering and Representation Learning 
Based on Networked Data 


Lazhar Labiod and Mohamed Nadif 


Abstract To deal simultaneously with both, the attributed network embedding and 
clustering, we propose a new model exploiting both content and structure infor- 
mation. The proposed model relies on the approximation of the relaxed continuous 
embedding solution by the true discrete clustering. Thereby, we show that incorporat- 
ing an embedding representation provides simpler and easier interpretable solutions. 
Experiment results demonstrate that the proposed algorithm performs better, in terms 
of clustering, than the state-of-art algorithms, including deep learning methods de- 
voted to similar tasks. 


Keywords: networked data, clustering, representation learning, spectral rotation 


1 Introduction 


In recent years, Networks [4] and Attributed Networks (AN) [8] have been used to 
model a large variety of real-world networks, such as academic and health care 
networks where both node links and attributes/features are available for analysis. 
Unlike plain networks in which only node links and dependencies are observed, 
with AN, each node is associated with a valuable set of features. In other words, we 
have X and W obtained/available independently of X. More recently, the learning 
representation has received a significant amount of attention as an important aim 
in many applications including social networks, academic citation networks and 
protein-protein interaction networks. Hence, Attributed network Embedding (ANE) 
[2] aims to seek a continuous low-dimensional matrix representation for nodes 
in a network, such that original network topological structure and node attribute 
proximity can be preserved in the new low-dimensional embedding. 

Although, many approaches have emerged with Network Embedding (NE), the 
research on ANE (Attributed Network Embedding) still remains to be explored 
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[3]. Unlike NE that learns from plain networks, ANE aims to capitalize both the 
proximity information of the network and the affinity of node attributes. Note that, 
due to the heterogeneity of the two information sources, it is difficult for the existing 
NE algorithms to be directly applied to ANE. To sum up, the learned representation 
has been shown to be helpful in many learning tasks such as network clustering [13], 
Therefore ANE is a challenging research problem due to the high-dimensionality, 
sparsity and non-linearity of the graph data. 

The paper is organized as follows. In Section 2 we formulate the objective function 
to be optimized, describe the different matrices used, and present a Simultaneous 
Attributed Network Embedding and Clustering (SANEC) framework for embedding 
and clustering. Section 3 is devoted to numerical experiments. Finally, the conclusion 
summarizes the advantages of our contribution. 


2 Proposed Method 


In this section, we describe the SANEC method. We will present the formulation of 
an objective function and an effective algorithm for data embedding and clustering. 
But first, we show how to construct two matrices S and M integrating both types of 
information —content and structure information- to reach our goal. 


2.1 Content and Structure Information 


An attributed network G = (V, 6, X) consists of V the set of nodes, E CV x V 
the set of links, and X = [xi,x»,...,x,] where n = |V| and x; € R4 is the 
feature/attribute vector of the node vj. Formally, the graph can be represented by two 
types of information, the content information X € R"*7 and the structure information 
A € R”*", where A is an adjacency matrix of G and aij = l if e;; € E otherwise 0; 
we consider that each node is a neighbor of itself, then we set aj; = 1 for all nodes. 
Thereby, we model the nodes proximity by an (n x n) transition matrix W given by 
W = D^! A, where D is the degree matrix of A defined by di; = $5 , avi. 

In order to exploit additional information about nodes similarity from X, we 
preprocessed the above dataset X to produce similarity graph input Wx of size 
(n x n); we construct a K-Nearest-Neighbor (KNN) graph. To this end, we use the 
heat kernel and Lz distance, KNN neighborhood mode with K = 15 and we set the 
width of the neighborhood o = 1. Note that any appropriate distance or dissimilarity 
measure can be used. Finally we combine in an (n x n) matrix S, nodes proximity 
from both content information X and structure information W. In this way, we intend 
to perturb the similarity W by adding the similarity from Wx; we choose to take S 
defined by S = W + Wx (Figure 1). 

As we aim to perform clustering, we propose to integrate it in the formulation of 
a new data representation by assuming that nodes with the same label tend to have 
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Fig. 1 Model and objective function of SANEC. 
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similar social relations and similar node attributes. This idea is inspired by the fact 
that, the labels are strongly influenced by both content and structure information 
and inherently correlated to both these information sources. Thereby the new data 
representation referred to as M = (mij) of size (n x d) can be considered as a 
multiplicative integration of both W and X by replacing each node by the centroid 
of their neighborhood (barycenter): i.e, mj; = Dzi WikXķj, Vi, j or M = WX. In 
this way, given a graph G, a graph clustering aims to partition the nodes in G into 
k disjoint clusters (C1, C2, . . ., Cg}, so that: (1) nodes within the same cluster are 
close to each other while nodes in different clusters are distant in terms of graph 
structure; and (2) the nodes within the same cluster are more likely to have similar 
attribute values. 


2.2 Model, Optimization and Algorithm 


Let k be the number of clusters and the number of components into which the data 
is embedded. With M and S, the SANEC method that we propose aims to obtain 
the maximally informative embedding according to the clustering structure in the 
attributed network data. Therefore, we propose to optimize 


: ub ul T T nxk 
y Bin |IM-5Q'| -aAls-GzB'| B'B-LZ'Z-LG e (0,.1)^* (1) 
where G = (gij) of size (n x k) is a cluster membership matrix, B = (b;;) of size 
(n X k) is the embedding matrix and Z = (zij) of size (k x k) is an orthonormal 
rotation matrix which most closely maps B to G € {0,1}"**. Q e R4* is the 
features embedding matrix. Finally, The parameter 2 is a non-negative value and 
can be viewed as a regularization parameter. The intuition behind the factorization 
of M and S is to encourage the nodes with similar proximity, those with higher 
similarity in both matrices, to have closer representations in the latent space given 
by B. In doing so, the optimisation of (1) leads to a clustering of the nodes into k 
clusters given by G. Note that, both tasks -embedding and clustering- are performed 
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simultaneously and supported by Z; it is the key to attaining good embedding while 
taking into account the clustering structure. To infer the latent factor matrices Z, B, 


Q and G, we shall derive an alternating optimization algorithm. To this end, we rely 
on the following proposition. 


Proposition 1. Let be S € R"*", G € (0, 1^, Z e R&**, B e R™*, we have 
Is - GzB"|l = |s - BB'S|. + ||SB - GZI (2) 
proof. We first expand the matrix norm of the left term of (2) 
Is - GZB" || = IISI? + ||GZB" || - 27r(SGZB") (3) 
In a similar way, we obtain from the two terms of the right term of (2) 


|S - SBB7 ||’ = ||S||? - ||SB|?_ due to B7B =I (4) 


and ||SB- GZ||? = |ISBI? + ||GZ||? — 27r(SBZG"). 
Due also to B' B = I, we have 
IISB — GZ||* = ||SB|? + ||GZB" ||? - 27r(SGZB") (5) 
Summing the two terms of (4) and (5) leads to the left term of (2). 
ISI? + ||GZ||? - 27r(SGZB") = ||S - GZB" | due to ||GZ||? = \|GzB" | 


Compute Z. Fixing G and B the problem which arises in (1) is equivalent to 
minz |S - GZB" lee From Proposition 1, we deduce that 


min ||S - GZB" ||” & min |S - BB'S|| + ISB - GZ? (6) 


which can be reduced to maxz Tr(G'SBZ) s.t. Z'Z =I. As proved in page 29 
of [1], let UZV' be the SVD for G' SB, then Z = UV". 


Compute Q. Given G, Z and B, the opimization problem (1) is equivalent to 
ming ||M- BQ" ||’, and we get 


Q=M'B. (7) 


Thereby Q is somewhere an embedding of attributes. 
Compute B. Given G, Q and Z, the problem (1) is equivalent to 


max Tr(M'Q-ASGZB') st. B'B-I 
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In the same manner for the computation of Z, let Ü:V" be the SVD for (MTQ + 
ASGZ), we get 
B=UV'. (8) 


It is important to emphasize that, at each step, B exploits the information from the 
matrices Q, G, and Z. This highlights one of the aspects of the simultaneity of 
embedding and clustering. 

Compute G: Finally, given B, Q and Z, the problem (1) is equivalent to 
ming ||SB — GZ||?. As G is a cluster membership matrix, its computation is done as 
follows: We fix Q, Z, B. Let B — SB and calculate 


gik = lif k = arg min ||b; — zw ||? and 0 otherwise . (9) 


In summary, the steps of the SANEC algorithm relying on S referred to as SANECg 
can be deduced in Algorithm 1. The convergence of SANECs is guaranteed and 
depends on the initialization to reach only a local optima. Hence, we start the 
algorithm several times and select the best result which minimizes the objective 
function (1). 


Algorithm 1 : SANECs algorithm 


Input: M and S from structure matrix W and content matrix X; 
Initialize: B, Q and Z with arbitrary orthonormal matrix; 
repeat 
(a) - Compute G using (9) 
(b) - Compute B using (8) 
(c) - Compute Q using (7) 
(d) - Compute Z using (6) 
until convergence 
Output: G: clustering matrix, Z: rotation matrix, B: nodes embedding and Q: attributes embed- 
ding 


3 Numerical Experiments 


In the following, we compare SANEC with some competitive methods described later. 
The performances of all clustering methods are evaluated using challenging real- 
world datasets commonly tested with ANE where the clusters are known. Specifically, 
we consider three public citation network data sets, Citeseer, Cora and Wiki, which 
contain sparse bag-of-words feature vector for each document and a list of citation 
links between documents. Each document has a class label. We treat documents as 
nodes and the citation links as the edges. The characteristics of the used datasets are 
summarized in Table 1. The balance coefficient is defined as the ratio of the number 
of documents in the smallest class to the number of documents in the largest class 
while nz denotes the percentage of sparsity. 


Clustering pertormance(%) 
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Table 1 Description of datasets (#: the cardinality). 


datasets 


# Nodes # Attributes # Edges #Classes nz(%) Balance 


Cora 
Citeseer 
Wiki 


2708 1433 5294 7 98.73 022 
3312 3703 4732 6 99.14 0.35 
2405 4973 17981 17 86.46 0.02 


In our comparison we include standard methods and also recent deep learning 
methods; these differ in the way they use available information. Some of them (such 
as K-means) use only X as the baseline, while others use more recent algorithms 
based on X and W. All the compared methods are: TADW [14], DeepWalk [7] and 
Spectral Clustering [11]. Using X and W we evaluated GAE and VGAE [5], 
ARVGA [6], AGC [15] and DAEGC [12]. 

With the SANEC model, the parameter 4 controls the role of the second term 
IIS- GZB "||? in (1). To measure its impact on the clustering performance of SANECs, 
we vary A in (0, 1075, 102, 1071, 10°, 10!, 10°}. Through, many experiments, as 
illustrated in Figure 2 we choose to take A = 107°. The choice of 2 warrants in-depth 


evaluation. 


Cora 


1071 10? 
a a 


Citeseer Wiki 


Fig. 2 Sensitivity analysis of A using ACC, NMI and ART. 


Compared to the true available clusters, in our experiments the clustering per- 
formance is assessed by accuracy (ACC), normalized mutual information (NMI) 
and adjusted rand index (ARI). We repeat the experiments 50 times, with differ- 
ent random initialization and the averages (mean) are reported in Table 2; the best 


performance for each dataset is highlighted in bold. 


First, we observe the high performances of methods integrating information from 
W. For instance, RTM and RMSC are better than classical methods using only either X 
or W. On the other hand, all methods including deep learning algorithms relying on 


X and W are better yet. However, regarding SANEC with both versions relying on W, 
referred to as SANECw or S referred to as SANECs, we note high performances for all 


the datasets and with SANECs, we remark the impact of Wx; it learns low-dimensional 


representations while suits the clustering structure. 


To go further in our investigation and given the sparsity of X we proceeded to 
standardization tf-idf followed by L», as it is often used to process document-term 
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matrices; see e.g, [9, 10], while in the construction of Wx we used the cosine metric. 
In Figure 3 are reported the results where we observe a slight improvement. 


Table 2 Clustering performances (ACC % , NMI % and ARI 94). 


Datasets 

Methods [Input Cora Citeseer Wiki 

ACC NMI ARI ACC NMI ARI ACC NMI ARI 
K-means X 49.22 32.10 22.96 54.01 30.54 27.86 41.72 44.02 15.07 
Spectral W 36.72 12.67 03.11 23.89 05.57 01.00 22.04 18.17 01.46 
DeepWalk W 48.40 32.70 24.27 33.65 08.78 09.22 38.46 32.38 17.03 
RTM X, W 43.96 23.01 16.91 45.09 23.93 20.26 43.64 44.95 13.84 
RMSC X, W 40.66 25.51 08.95 29.50 13.87 04.88 39.76 41.50 11.16 
TAWD X, W 56.03 44.11 33.20 45.48 29.14 22.81 30.96 27.13 04.54 
VGAE X, W 50.20 32.92 25.47 46.70 26.05 20.56 45.09 46.76 26.34 
ARGE X,W 64.0 44.9 35.2 57.3 35.0 34.1 47.34 47.02 28.16 
ARVGE X,W 63.8 45.0 37.74 54.4 26.1 24.5 46.45 47.8 29.65 
SANECw X, W 64.47 43.30 36.19 64.71 38.61 39.20 46.21 42.83 28.30 
SANECs X,S 67.38 47.14 39.88 66.77 40.60 41.78 52.80 50.02 35.57 


[ sanec 12 ME sANEC 2 BH sanec 2 
[— ]SANEC tat [[— ]SANEC tiat [E ]SANEC tiat 
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Fig. 3 Evaluation of SANECg using tf-idf normalization of X and cosine metric for Wx. 


4 Conclusion 


In this paper, we proposed a novel matrix decomposition framework for simultane- 
ous attributed network data embedding and clustering. Unlike known methods that 
combine the objective function of AN embedding and the objective function of clus- 
tering separately, we proposed a new single framework to perform SANECs for AN 
embedding and nodes clustering. We showed that the optimized objective function 
can be decomposed into three terms, the first is the objective function of a kind of 
PCA applied to X, the second is the graph embedding criterion in a low-dimensional 
space, and the third is the clustering criterion. We also integrated a discrete rotation 
functionality, which allows a smooth transformation from the relaxed continuous 
embedding to a discrete solution, and guarantees a tractable optimization problem 
with a discrete solution. Thereby, we developed an effective algorithm capitalizing 
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on learning representation and clustering. The obtained results show the advantages 
of combining both tasks over other approaches. SANECg outperforms all recent meth- 
ods devoted to the same tasks including deep learning methods which require deep 
models pretraining. However, there are other points that warrant in-depth evaluation, 
such as the choice of 4 and the complexity of the algorithm in terms of network size. 
The proposed framework offers several perspectives and investigations. We have 
noted that the construction of M and S is important, it highlights the introduction of 
W. As forthe Wx we have observed that it is fundamental as it makes possible to link 
the information from X to the network; this has been verified by many experiments. 
First, we would like to be able to measure the impact of each matrix W and Wx in 
the construction of S by considering two different weights for W and Wx as follows: 
S = oW + BWx. Finally, as we have stressed that Q is an embedding of attributes, 
this suggests to consider also a simultaneously ANE and co-clustering. 
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Towards a Bi-stochastic Matrix Approximation 
of k-means and Some Variants 


Lazhar Labiod and Mohamed Nadif 


Abstract The k-means algorithm and some k-means variants have been shown to 
be useful and effective to tackle the clustering problem. In this paper we embed 
k-means variants in a bi-stochastic matrix approximation (BMA) framework. Then 
we derive from the k-means objective function a new formulation of the criterion. In 
particular, we show that some k-means variants are equivalent to algebraic problem 
of bi-stochastic matrix approximation under some suitable constraints. For optimiz- 
ing the derived objective function, we develop two algorithms; the first one consists 
in learning a bi-stochastic similarity matrix while the second seeks for the opti- 
mal partition which is the equilibrium state of a Markov chain process. Numerical 
experiments on real data-sets demonstrate the interest of our approach. 


Keywords: k-means, reduced k-means, factorial k-means, bi-stochastic matrix 


1 Introduction 


These last decades unsupervised learning and specifically clustering, have received 
a significant amount of attention as an important problem with many application in 
data science. Let A = (a;;) be an x m continuous data matrix where the set of rows 
(objects, individuals) is denoted by 7 and the set of columns (attributes, features) by 
J. Many clustering methods such as hierarchical or not aim to construct an optimal 
partition of J or, sometimes of J. 

In this paper we show how some k-means variants can be presented as a bi- 
stochastic matrix approximation problem under some suitable constraints generated 
by the properties of the reached solution. To reach this goal, we first demonstrate that 
some variants of k-means are equivalent to learning a bi-stochastic similarity matrix 
having a diagonal block structure. Based on this formulation, referred to as BMA, 
we derive two iterative algorithms, the first algorithm learns a bi-stochastic n x n 
similarity matrix while the second directly seeks an optimal clustering solution. 

Our main contribution is to establish the theoretical connection of the conventional 
k-means and some of its variants to BMA framework. The implications of the 
reformulation of k-means as a BMA problem are multi-folds: 
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* It makes connections with recent clustering methods like spectral clustering and 
subspace clustering. 

e [tlearns a well normalized (bi-stochastic normalization) similarity matrix, bene- 
ficial for spectral clustering [12]. 

* Unlike existing spectral and subspace methods which combine in a sequential 
way, the steps of similarity learning and clustering derivation, our proposed 
method jointly learns a block diagonal bi-stochastic affinity matrix which naturally 
expresses a clustering structure. 


The rest of paper is organized as follows. Section 2 introduces some variants of 
k-means. Section 3 provides Matrix Factorization (MF) and BMA formulations of 
k-means variants. Section 4 discusses the BMA clustering algorithm and section 5 
is devoted to numerical experiments. Finally, the conclusion summarizes the interest 
of our contribution. 


2 Variants of k-Means 


Given a data matrix A = (a;;) € R"*", the aim of clustering is to cluster the rows 
or the columns of A, so as to optimize the difference between A = (a;;) and the 
clustered matrix revealing significant block structure. More formally, we seek to 
partition the set of rows J = (1,...,n) into k clusters C = {C,...,Cj,..., Ck}. 
The partitioning naturally induce clustering index matrix R = (rjj) € R"**, defined 
as binary classification matrix such as we have ri; = 1, if the row a; € Cı, and 
0 otherwise. On the other hand, we note $ € R”** a reduced matrix specifying 
the cluster representation. The detection of homogeneous clusters of objects can be 
reached by looking for the two matrices R and S minimizing the total squared residue 
measure 

Ika (R, S) = ||A - RS || (1) 


The term RS" characterizes the information of A that can be described by the clusters 
structure. The clustering problem can be formulated as a matrix approximation 
problem where the clustering aims to minimize the approximation error between the 
original data A and the reconstructed matrix based on the cluster structures. 

Factorial k-means analysis (FKM) [9] and Reduced k-means analysis (RKM) 
[1] are clustering methods that aim at simultaneously achieving a clustering of the 
objects and a dimension reduction of the features. The advantage of these methods 
is that both clustering of objects and low-dimensional subspace capturing the cluster 
structure are simultaneously obtained. To achieve this objective, RKM is defined by 
the minimizing problem of the following criterion 


Tex m(R, S, Q) = ||A - RST Q"? 2) 
and FKM is defined by the minimizing problem of the following criterion 


Trxm(R,S,Q) = ||AQ - RSTI}? (3) 
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where S € R?** with RKM and FKM, and Q is an m by p column-wise orthonormal 
loading matrix. 


3 Bi-stochastic Matrix Approximation of k-Means Variants 
3.1 Low-rank Matrix Factorization (MF) 


By considering k-means as a lower rank matrix factorization with constraints, rather 
than a clustering method, we can formulate constraints to impose on MF formulation. 
Let D7! € R*** be diagonal matrix defined as follow D7! = Diag(r|!, Ain rp 
Using the matrices D,, A and R, the matrix summary S can be expressed as S^ = 
D;R' A. Plugging S into the objective function in equation, (1) leads to optimize 
||A — R(D;!R™ A)||? equal to 


Iur-Km (R) = ||A—RR™A||?, where R = RD;°°. (4) 


On the other hand, it is easy to verify that the approximation RR' A of A is formed 
by the same value in each block Aj (;=1,...,4). Specifically, the matrix R'A, equal to 
ST , plays the role of a summary of A and absorbs the different scales of A and R. 
Finally RR' A gives the row clusters mean vectors. Note that it is easy to show that 
R verifies the following properties 


R > 0,R'R = 44, RR" 1 = 1, Trace(RR') = (RR)? = RR" (5) 
Next, in similar way, we can derive a MF formulation of FKM, 


Ju rF-FkM(R) = ||AQ - RR™AQ||’, (6) 


and of RKM, .Jur-nkM(R) = ||A - RR'AQQ"|l". (7) 


3.2 BMA Formulation 


Let II = RR' be a bi-stochastic similarity matrix, before giving the BMA formula- 
tion of k-means variants, we need first to spell out the good properties of II. Indeed, 
by construction from R, II has at least the following properties reported below that 
can be easily proven. 


N > 0, 0 = 0, M1 = 1,7race(M) =k, III! = I, Rank(I) 2k. — (8) 


Given a data matrix A and k row clusters, we can hope to discover the cluster structure 
of A from II. Notice that from (8) IT is nonnegative, symmetric, bi-stochastic (doubly 
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stochastic) and idempotent. By setting the kmeans in the BMA framework, the 
problem of clustering is reformulated as the learning of a structured bi-stochastic 
similarity matrix II by minimizing the following k-means variants objective, 


Jem A-kM (I) = ||A - HAI’, (9) 
Jem a-rKm (I) = ||AQ - MAQ||’, (10) 
JBMA-RKM (UI) = ||A-MAQQO"|/, (11) 


with respect to the following constraints on I 
H > 0,0 = II', N1 = 1, TrM) =k, IN" = M (12) 


andQ'Q-I for equations (10) and (11). 


In the rest of the paper, we will consider only non-negativity, symmetry and bi- 
stochastic constraints. 


3.3 The Equivalence Between BMA and k-Means 


The theorem below demonstrates that the optimization of the k-means objective and 
the BMA objective under some suitable constraints are equivalent. The equation 
(13) establishes the equivalence between k-means and the BMA formulation. Then, 
solving the BMA objective function (9) is equivalent to finding a global solution of 
the k-means criterion (1). 


Theorem 1 


arg min ||A — RS" ||? & arg IA- TAI? — (13) 


min 
(II20,II-II7 11-17 r (ID) -k, HIT" - 101) 
The proof of this equivalence is given in the appendix. Note that this new formulation 
gives some interesting highlights on k-means and its variants: 


* First, this shows that k-means is equivalent to learning a structured bi-stochastic 
similarity matrix which is normalized bi-stochastic matrix with block diagonal 
structure. 

* Secondly, it establishes very interesting connections of k-means to many state-of- 
the-art subspace clustering methods [10, 5]. Moreover, this formulation combines 
the traditional two-step process used by subspace clustering methods, which con- 
sist in first constructing an affinity matrix between data points and then applying 
spectral clustering to this affinity. This allows joint learning of a similarity matrix 
that better reflects the clustering structure by its block diagonal shape. 

* Finally, it allows to apply the spirit of k-means for graph or similarity data. 
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4 BMA Clustering Algorithm 


First, we establish the relationship between our objective function and that used in 
[12, 11]. From ||A - IIA|? = Trace(AA") + Trace(YAA'II) - 2Trace(AA'ITI) 
and by using the idempotent property, IIII" = II , we can show that 


arg min \|A - AI? e arg min ||AA" - 0j? e arg max Trace(AA' TD). 


The algorithm for learning similarity matrix is summarized in Algorithm 1 as in 
[12, 11]. Once the bi-stochastic similarity matrix II is obtained, the basic idea of 
BMA is based on the following steps: 


Algorithm 1 : Learning similarity matrix 
Input: data A 
Output: similarity matrix II 
Initialize: t = 0 and I = AAT 
repeat 
ne) c [n 4 (7 -nO 4 227n?)33,7 nO] 
until Satisfied convergence condition — 


1. Estimating iteratively A by applying at each time the matrix II on the current 
A using the following update A) = TIA, This process converges to an 
equilibrium (steady) state. Let k be the multiplicity of the eigenvalue of matrix 
II equal to 1, A is composed of k << n quasi-similar rows, where each row is 
represented by its prototype. 

2. Extracting the first left singular vectors z of A using the Power method [4]; 
it is a well-known technique used for computing the largest left eigenvector of 
data matrix. The numerical computation of the leading left singular vector of A, 
consists in starting with an arbitrary vector 2), repeatedly performing updates 

z0 We 


a 


stop the Power method if, |y ^? — y | = e where yt) e |[g*P — 7 |I. 


of x until stabilization of 7 as follow: 1 *? = AATx@and 2 — 


Why does this work? At first glance, this process might seem uninteresting since it 
eventually leads to a vector with all rows and columns coincide for any starting vector. 
However our practical experience shows that, first the vectors z very quickly collapse 
into rows blocks and these blocks move towards each other relatively slowly. If we 
stop the Power method iteration at this point, the algorithm would have a potential 
application for data visualization and clustering. The structure of x during short-run 
stabilization makes the discovery of rows data ordering straightforward. The key is 
to look for values of z that are approximately equal and reordering rows and columns 
data accordingly. The BMA algorithm involves a reorganization of the rows of data 
matrix A according to sorted z. It also allows to locate the points corresponding to 
an abrupt change in the curve of the first left singular vector 7, and then assess the 
number of clusters and the rows belonging to each cluster. 
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5 Experiments Analysis 


In this subsection we first ran our algorithm on two real world data set, the 16 town- 
ships data which consists of the characteristics (rows) of 16 townships (columns), 
each cell indicates the presence 1 or absence 0 of a characteristic on a township . This 
example has been used by Niermann [7] for data ordering task and the author aims to 
reveal a block diagonal form. The second data called Mero data, comes from archaeo- 
logical data on Merovingian buckles found in north eastern France. This data matrix 
consists of 59 buckles characterized by 26 attributes of description (see Marco- 
torchino for more details [6]). Figure 1 shows in order, A, A, SR = AAT reorganized 
according to the sorted z and the sorted z plot for both data sets. We also evaluated 


otit 


i 


Fig. 1 left: 16 Townships data - right: Mero data. 


the performances of BMA on some real challenging datasets described in Tablel. 
We compared the performance of BMA with the spectral co-clustering (SpecCo) 
[2], Non-negative Matrix Factorization (NMF) and Orthgogonal Non-negative Matrix 
Tri-Factorization (NMTF) [3] by using two evaluation metrics: accuracy (ACC) cor- 
responding to the percentage of well-classified elements and the normalized mutual 
information (NMI) [8]. In Table 1, we observe that BMA outperforms all compared 
algorithms for all tested datasets. 


Table 1 Clustering Accuracy and Normalized Mutual Information (%). 


datasets |f samples # features #classes per k-means NMF ONMTF SpecCO BMA 


Classic3 3891 4303 3 ACC 88.6 73.33 70.10 97.89 98.30 
NMI 74.9 51.46 51.46 91.17 91.91 
CSTR 476 1000 4 ACC 76.3 75.30 77.41 80.21 90.73 


NMI 65.4 66.40 67.30 66.36 77.86 
Webkb4 4199 1000 4 ACC 60.10 66.30 67.10 61.68 68.8 
NMI 45.7 42.70 45.36 48.64 49 
Leukemia 38 5000 3 ACC 72.2 89.21 90.32 94.73 97.36 
NMI 19.4 75.42 80.50 82 90.69 
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6 Conclusion 


In this paper we have presented a new reformulation of some variants of k-means 
as a unified BMA framework and established the equivalence between k-means and 
BMA under suitable constraints. By doing so, k-means leads to learning a structured 
bi-stochastic matrix which is beneficial for clustering task. The proposed approach, 
not only learns a similarity matrix from data matrix, but uses this matrix in an iter- 
ative process that converges to a matrix A in which each row is represented by its 
prototype. The clustering solution is given by the first left eigenvector of A while 
overcoming the knowledge of the number of clusters. We expect for future work to 
integrate the idempotent and trace constraints on II to make the approximate simi- 
larity matrix fits the best the case of a block diagonal structure. 


Appendix 


From the BMA's formulation, we know that one can easily construct a feasible solu- 
tion for k-means from a feasible solution of BM A's formulation. Therefore, it remains 
to show that from a global solution of BMA's formulation, we can obtain a feasible 
solution of k-means. In order to show the equivalence between the optimization of k- 
means formulation and the BMA formulation, we first consider the following lemma. 


Lemma If II is a symmetric and positive semi-definite matrix, then we have 


(a)zi» € Anny; (geometric mean) Vi, i’ 
(b) aw < iG tg) (arithmetic mean) Vi, i’ 
(c) maxi; zi» = max; Tii 


(d)zii 202 Tj = Nyi = 0 Vi, i’ 


Proposition. Any positive semi-definite matrix II satisfying the constraints: 


Tiv = nyi Vi, i (symmetry) 
Tiv = Mj Mirnyin Vi, i (idempotence) 
»» Tii = 1 Vi 

Dini =k 


is a matrix partitioned into k blocks II = diag(II!,..., 0}, scs HS) with 
II! = lili, trace(II!) = 1 VI and s nı = n; 1; denotes the vector of ap- 


propriate dimension with all its values are 1. 


Proof. Since II is idempotent (IP = ID, we have: Vi; ni = Èy "n From the Lemma 
above, we know that there exist; i? € {1,2,...,n} such as max; Tyi = 7tojo > 0. 
Consider the set Ajo defined by Ajo = (i|z;o; > O}, we can rewrite; Vi € Ajo; Mi = 


Lived, T 
Vic Ajo; p» Tyri = X ari =1 (14) 


EA vel 
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and, 
2, 2,757 2, n.- Dy 3e Mel (15) 
EA,0 i€A;o i€A;o i€Aio 
2 
; : um Tii 
yi Aem = Vi € Ap; * as » C ym = 1. (16) 
i! íeAo " ieAo '" 


From (14) and (16), we deduce that Vi € Ajo; XyeA, Ti = Erea CE niv, 


implying that: mi = Ti, Vi,i' € Ajo. Substituting in (15) m; by mx for all 
i,i' € Ajo leads to X,/e4 , Tir = Died, Tii = lAj|zii =1, Vi € Ajo. From this we 


can deduce that mii = Tir = Aol? Vi,i’ € Ajo. We can therefore rewrite the matrix 
Li 0 0 
II in the form of a block diagonal matrix II = | 0 M where II? is a block matrix 


whose general term is defined by II2, — TAs? Vi, i’ € Ajo and trace(II?) = 1. 


The matrix II? is a positive semi-definite matrix which also verified the constraints 
(11°)? = 0°, 91 = 1, (119)? = II? and trace({®) = k — 1. 

By repeating the same process k — 1 times, we get the block diagonal form of II. 
II = diag(II?, I1!, . .., II^, ..., 1177) with, I = +1,1}, trace(IT') = 1VI and 


k-1 
Èi- ni = n. 
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Clustering Adolescent Female Physical Activity 
Levels with an Infinite Mixture Model on 
Random Effects 


Amy LaLonde, Tanzy Love, Deborah R. Young, and Tongtong Wu 


Abstract Physical activity trajectories from the Trial of Activity in Adolescent Girls 
(TA AQ) capture the various exercise habits over female adolescence. Previous analy- 
ses of this longitudinal data from the University of Maryland field site, examined the 
effect of various individual-, social-, and environmental-level factors impacting the 
change in physical activity levels over 14 to 23 years of age. We aimed to understand 
the differences in physical activity levels after controlling for these factors. Using a 
Bayesian linear mixed model incorporating a model-based clustering procedure for 
random deviations that does not specify the number of groups a priori, we find that 
physical activity levels are starkly different for about 5% of the study sample. These 
young girls are exercising on average 23 more minutes per day. 


Keywords: Bayesian methodology, Markov chain Monte Carlo, mixture model, 
reversible jump, split-merge procedures 


1 Introduction 


Physical activity and diet are arguably the two main controllable factors having the 
greatest impact on our health. Whereas we have little to no control over factors like 
our genetic predisposition to disease or exposure to environmental toxins, we have 
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much greater control over our diet and activity levels. Despite our ability to choose to 
engage in healthy behaviors such as exercising and eating a healthy diet, these choices 
are plagued with the complexity of human psychology and the modern demands and 
distractions that pervade our lives today. Several factors influence levels of physical 
activity; we explore the factors impacting female adolescents using longitudinal data. 

The University of Maryland, one of the six initial university field centers of the 
Trial of Activity in Adolescent Girls (TAAG), selected to follow its 2006 8^" grade 
cohort for two additional time points over adolescence: 11'" grade and 23 years of 
age. The females were therefore measured roughly at ages 14, 17, and 23. In these 
waves, there was no intervention as this observational longitudinal study aimed at 
exploring the patterns of physical activity levels and associated factors over time. 

The model presented in Wu et al. [1] motivates the current work. We fit a similar 
linear mixed model controlling for the same variables. Rather than cluster the raw 
physical activity trajectories to identify groups, we cluster the females within the 
model-fitting procedure based on the values of the subject-specific deviations from 
the adjusted physical activity levels. Fitting a Bayesian linear mixed model, we 
simultaneously explore the subject groups through the use of reversible jump Markov 
chain Monte Carlo (MCMC) applied to the random effects. Bayesian model-based 
clustering methods have been applied within linear mixed models to identify groups 
by clustering the fitted values of the dependent variable. For example, [2] fits cluster- 
specific linear mixed models to the gene expression outcome using an EM algorithm 
and [3] clusters gene expression in a similar fashion, except using Bayesian methods. 
In contrast, we perform the clustering on the random effects, which allows us to 
investigate the variability that is unexplained by the covariates of interest. This 
methodology is advantageous because of its ability to jointly estimate all effects, 
while also exploring the infinite space of group arrangements. 


2 Bayesian Mixture Models for Heterogeneity of Random Effects 


Let y; = (yii, -... yir) be the i^" subject's average daily moderate-to-vigorous 
physical activity (MVPA) at each of the T = 3 time points. The MVPA was collected 
from ActiGraph accelerometers (Manufacturing Technologies Inc. Health Systems, 
Model 7164, Shalimar, FL) worn for seven consecutive days. Accelerometers offered 
a great alternative to self-report for tracking physical activity levels, and measuring 
over seven days helped to account for differences in activity patterns during weekdays 
and weekends. Wu et al. [1] analyzed this cohort using mixed models that accounted 
for the subject-specific variability. We let X; represent the i^" subject's values for 
covariates. 

Furthermore, let r = (r1, .. . , rn) represent the subject-specific random effects for 
the n subjects. The simple linear mixed model is written in terms of each subject as 


yi = XiB +rilr + €; (1) 
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where f represents the coefficients for the covariate effects and e; = (6,1,...,&,7) 
are the residuals. We assume independence and normality in the residuals and the 
random effects; hence, r; ~ N(0, 02) and e; ~ N(0, o7Ir) fori=1,...,n. 


Fitting the mixed model demonstrates substantial heterogeneity in the residuals, 
the variability increases as the fitted values increase. A traditional approach to fixing 
this violation would re-fit the model to the log-transformed MVPA values. Plots 
of residuals versus fitted values in this model approach also exhibited evidence of 
heterogeneity in the model; thus, still violating a core assumption of the regression 
framework. Given the changes adolescents experience as they grow into young adults, 
we expect to see heterogeneity in the physical activity patterns across this duration of 
follow-up time. However, the inability of the model to capture such changes over time 
at these higher levels of physical activity suggests the need for model improvements. 
The purpose of this analysis is to present our adjustments to previous analyses in 
order to investigate underlying characteristics across different groups of females 
formed based on deviations from adjusted physical activity levels. 
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Fig. 1 The plot on the left depicts the residuals versus fitted values for the linear mixed model 
in Eq. (1); they demonstrate severe heteroscedasticity. The variance increases as the fitted values 
increase. The plot on the right depicts the distribution of the random effects. 


We fit the mixed model in Eq. (1) to the sample of female adolescents. The 
heteroscedasticity depicted in Figure 1 reveals an increase in variance with predicted 
minutes of moderate-to-vigorous physical activity, which we would expect. The plot 
on the right in Figure 1 demonstrates that the distribution of the random effects do 
not appear to follow our assumption of normally distributed and centered around 
zero. The random effects do appear to follow a normal distribution over the lower 
range of deviations with a subset of the subjects having larger positive deviations 
from the estimated adjusted physical activity levels. 

To capture the heterogeneity and allow the random effects to follow a non-normal 
distribution, we assign the random effects a Gaussian mixture distribution. Before 
introducing the model for heterogeneity, we note the likelihood distribution for the 
observed outcomes, Y = (y1,..., yr)’. The moderate-to-vigorous physical activity 
distribution is 
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Then to account for the heterogeneity across subjects, the probability density for 
the subject-specific deviations in physical activity is expressed as a mixture of one- 
dimensional normal densities, 


G 1 
723 1 
p(rilu. o2) = bp (2207, a) exp Exe - | : (3) 
g-l up 
Here, u = (ui,....uG) defines the group-specific mean deviations, 72 = 
(o? AT o? c). characterizes the variances of the group-specific deviations, and 
z= (7,...,76)' is the probability of membership in each group g. 


The model in Eqs. (2) and (3) can be fit using either an EM or Bayesian MCMC 
procedures. Both require specification of a fixed number of G-groups. While we 
may hypothesize that there are only two groups-one that is normally distributed and 
centered at zero and another that is normally distributed and centered at a larger 
mean-the assumption hinges on what we have seen from plots like those in Figure 
1. The random effects in the aforementioned histogram, however, are being shrunk 
towards zero by assumption; while a mixture model will allow the data to more 
accurately depict the deviations observed in the girl's physical activity levels. The 
assumption of G groups can strongly influence the results of our model fitting. To 
circumvent the issues associated with selecting G in either an EM algorithm or a 
Bayesian finite mixture model framework, we implement a Bayesian mixture model 
that incorporates G as an additional unknown parameter. 


2.1 Bayesian Mixed Models With Clustering 


Richardson and Green [4] adapts the reversible jump methodology to univariate nor- 
mal mixture models. In addition to being able to characterize the distribution of G, 
this Bayesian framework has the ability to simultaneously explore the posterior dis- 
tribution for the covariate effects of interest. Furthermore, we will have the posterior 
distributions of the group-defining parameters rather than just point estimates. Since 
we are interested in the physical activity differences in subjects when controlling for 
these covariates, we use Eq. (1) as the basis of our model. 

The foundation of our clustering model is a finite mixture model on the random 
effects, rji, as shown in Eq. (3). For all i = 1,...,n and g = 1,...,G, rilci, p ~ 
F.(uc;, e (ci = g)|z, G ~ Categorical(m,...,7G), Melt  N(uo, T), 

0? gles ô ^ IG(c, 6), n|G ~ Dirichlet(o,...,0), G ~ Uniform[1, Gmax], where c; 
isthe latent grouping variable tracking the assignment of r; into any one ofthe G clus- 
ters. The likelihood function for these subject-specific deviations, given the group as- 


r NE 73 
signment, c;, is simply p(r;|c; = 8- Hg» ce a) E (220? .) exp I6 - ug] ; 
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This replaces the typical independent and identically distributed assumption of 
ri ~ N(0,02) for all i with a normal distribution that is now conditional on group 
assignment. The remainder of the model formulation follows closely to the frame- 
work constructed in [4], except we have an additional layer of unknown parameters 
defining the linear mixed model in Eq. (1). 

We select conjugate priors so that the the posterior distributions of the unknown 
parameters are analytically tractable. The prior on the mixing probabilities, zr, is a 
symmetric Dirichlet distribution, reflecting the prior belief that belonging to any one 
cluster is equally likely. To use the sampling methods of [4], we select a discrete 
uniform prior on G that reflects our uncertainty on the number of groups, and impose 
an a priori ordering of the ug, such that for any given value G, uj < u2 € +: < UG, 
to remove label switching. Thus, in the prior for the clustering parameters, 


S 1 
p(u) =G! I] y (27T) 2 exp [- = my) 
g-l 


6E 


2 = 2 \-c-1 
pis a I(c) (Tig exp o2, 
1 
p(G) = 1(G € [1, Gmax]}, 


max 


where Gmax is set to be reasonably large and 1(G € [1, Gmax]} is a discrete 
indicator function, equal to 1 on the interval [1, Gmax] and 0 elsewhere. 

The capacity of our sampler to move between dimensions is essential to our 
ability to explore the grouping of the observations while simultaneously exploring 
the parameters describing the relationships between the covariates and the outcome. 
This means that we can allow the number of components of our mixture model on the 
random effects to increase or decrease at each state of our MCMC chain. Such changes 
impact the dimension of the parameters of the mixture model, 0 = (u, 02, G, 7, c). 

Let 0 denote the current state of the parameters (u, 02, G, zt, c) when propos- 
ing move m where m € {S,M,B,D} corresponds to a split, merge, birth and 
death, respectively. Given the current state, 0, and move m, we propose a new 
state, 0", under move m. The acceptance probability is written as acc,,(0"*, 9) = 
1, P(0" |Da 0” |m7) 

> p(8lr)q(8|m) 
tribution, respectively. In our case, the target distribution is the posterior distribution 


of our group-specific parameters, (4, 0, 2, c), given the data, r, which are the ran- 
dom effects. Each proposed move changes the dimension of the parameters in 0 by 1, 
adding or deleting group-specific parameters. The ratio q(0"' [m^ /q(0|m) ensures 
"dimension balancing", as explained in [4]. For moves increasing in dimension, the 
Jacobian, |J|, is computed as |60"' /6(0, u)| because moving from 0 to 0" will re- 
quire additional parameters, u to appropriately match dimensions. The opposite is 
true for moves decreasing in dimension. This is what we refer to as the reversible 
jump mechanism; each time a split is proposed, we must also design the reversible 
move that would result in the currently merged component, and vice versa. 


min |J|| where p(-) and q(-) denote the target and proposal dis- 
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Split and merge moves are implemented for our model. These moves update z, 
H, and c for two adjacent groups or create two adjacent groups using three Beta- 
distributed additional parameters, u, for dimension balancing in a similar way to 
[4]. Within our context of random effects, births and deaths are not appropriate. 
A singleton causes issues of identifiability because the r; is no longer defined as 
random. We do not allow for birth and death moves in our reversible jump methods. 


3 Trial of Activity in Adolescent Girls (TAAG) and Model Results 


Our analysis focuses only on these girls from the University of Maryland site of the 
TAAG study who were measured at all three follow-up time points, beginning in 
2006. After excluding girls with missing outcomes, the final sample consisted of 428 
girls measured in 2006, 2009, and 2014. Missing covariate values were imputed for 
four subjects using the values from the nearest time point. 

We determine the group assignments using an MCMC sampler having 10,000 it- 
erations, with a burn-in of 500 draws. The posterior distribution for G was extremely 
peaked at G = 2. Summarization of the posterior distribution of the group assign- 
ments via the least squares clustering method delivers the final arrangement, €; s, of 
girls into two groups describing their physical activity levels [5]. Since our sampler 
explores several models for which group assignments and G can vary, we sample 
additional draws from the posterior distribution of the remaining parameters of in- 
terest using an MCMC sampler with the model specification of Eq. (1) with groups 
fixed at our posterior assignment, 6; s, for the subject-specific random effects. This 
additional chain was run for 10,000 iterations with a burn-in 500 draws, yielding the 
results summarized below. Convergence diagnostics indicated that 10,000 iterations 
sufficiently met the effective sample size threshold for estimating the coefficients for 
the covariate effects, B, and the group-specific means, u, describing the deviations 
of the girls’ physical activity levels [6]. 

After controlling for covariates believed to best describe the variation in the 
physical activity levels of females, our method finds that there is a small subset of the 
females who are much more active than the remainder of the sample. Every subject 
in the more active group has fitted trajectories above the recommended 30 minutes 
of exercise. Most of the population does not get the recommended allowance of daily 
physical activity and this is well-supported in our analysis. All but two subjects in the 
less active group have fitted trajectories that never pass the recommended 30 minutes 
of exercise. The random effects from this model better fit a normal distribution (not 
centered at 0) for each of the two groups and do not show as much heteroscedasticity 
over time as the one group model depicted in Figure 1. 

Given these differences are observed even after controlling for the aforementioned 
variables, we would like to further examine the characteristics that may set these 
highly active females apart from the rest of the girls in our sample. To do this, we 
look at a number of other covariates that were either excluded during the variable 
selection process or were not measured at all time points. We use simple Wilcoxon 
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tests on the available time points of the additional variables and on all time points 
for covariates we adjusted for in the initial model. 

We first note that the median BMI of the subset of highly active girls is sig- 
nificantly lower than that of the remaining girls consistently at each TAAG wave. 
Similarly, mother's education level is also consistently significant at each time point. 
These values are measured at each time point to reflect changes as the mother pursues 
additional education, or as the girls become more aware of their mother's education. 
The majority of the highly active girls have mother's who have completed college 
or higher (75% or higher at each time point); whereas, the remainder of the sample 
has mother's with a range of education levels (less than high school through college 
or more). The number of parks within a one-mile radius of the home is significantly 
different among the high and low groups in the middle school and high school years, 
when the girls are likely to be living at home. This variable may be an indicator of so- 
cioeconomic status as families with more money may live in neighborhoods nearer to 
parks. Finally, in the high school and college-aged years, the self-management strate- 
gies among the highly active girls are significantly higher rated than the remainder 
of the population. 

In high school, the subset of highly active girls tend to have better self-described 
health, participate in more sports teams, have access to more physical education 
classes, and have been older at the time of their first menstrual period. At the college 
age, these girls still have higher self-described health; however, the higher levels 
of the global physical activity score and self-esteem scores are now significantly 
improved in the subset of highly active females. 


4 Discussion 


We extended the mixed models of [1] with the application still focused on the same 
428 girls from the TAAG, TAAG 2, and TAAG 3 studies. Within the Bayesian 
linear mixed model, we implemented a clustering procedure aimed at clustering 
girls into groups based on deviations from the adjusted physical activity levels. 
These groups reflected the tendency for small subsets of females to be highly active. 
Not surprisingly, only 24 girls (5% of our sample) were classified as highly active. 

This group of highly active girls differs in several ways. These girls are more 
active, and thus we expect that the age at first menstrual period will be higher. We 
may also expect that the highly active girls are involved in more sports teams and 
that they will have higher global physical activity scores. Some other interesting 
characteristics of these girls, however, is their increased self-management strategies, 
self-esteem scores, and self-described health. This may suggest that interventions 
focusing on time management and emphasizing self-efficacy could impact adolescent 
female physical activity levels. In doing so, we could aim to increase self-esteem and 
self-described health. 

The ability to account for heterogeneity in the subject-specific deviations from 
an adjusted model allows us to keep the outcome on the original scale while still 
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improving model assumptions. Our model estimates model parameters while identi- 
fying groups of observations with differing activity levels. In contrast, a frequentist 
approach could be taken using EM algorithm; however, we would lose the ability 
for the data to give statistical inference on the appropriate number of groups and to 
incorporate posterior samples with different numbers of groups into the estimated 
class label. 

The current analysis looks only at identifying groups based on deviations from 
the overall adjusted minutes of MVPA for the females. A natural extension would 
be to look at clustering on the slope for time to begin to understand the various 
patterns we observe among adolescent females over time. Furthermore, we may 
want to incorporate a variable selection procedure into the fixed portion of the 
model. The groups we find by either clustering on subject-specific intercepts and/or 
slopes would be sensitive to the covariates selected, depending on the variability 
captured by this fixed portion of the model. Physical activity, like most human 
behavior, varies widely for a multitude of reasons, many of which we may not think 
to or are unable to measure. Identifying groups when a traditional mixed model 
constructed using standard variable selection methods suggests lack of fit can be a 
useful step towards better understanding differences through post-hoc analyses of 
the groups' characteristics. 
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Unsupervised Classification of Categorical Time 
Series Through Innovative Distances 


Angel López-Oriona, José A. Vilar, and Pierpaolo D’ Urso 


Abstract In this paper, two novel distances for nominal time series are introduced. 
Both of them are based on features describing the serial dependence patterns between 
each pair of categories. The first dissimilarity employs the so-called association 
measures, whereas the second computes correlation quantities between indicator 
processes whose uniqueness is guaranteed from standard stationary conditions. The 
metrics are used to construct crisp algorithms for clustering categorical series. The 
approaches are able to group series generated from similar underlying stochastic 
processes, achieve accurate results with series coming from a broad range of mod- 
els and are computationally efficient. An extensive simulation study shows that the 
devised clustering algorithms outperform several alternative procedures proposed in 
the literature. Specifically, they achieve better results than approaches based on max- 
imum likelihood estimation, which take advantage of knowing the real underlying 
procedures. Both innovative dissimilarities could be useful for practitioners in the 
field of time series clustering. 


Keywords: categorical time series, clustering, association measures, indicator pro- 
cesses 


1 Introduction 


Clustering of time series concerns the challenge of splitting a set of unlabeled time 
series into homogeneous groups, which is a pivotal problem in many knowledge 
discovery tasks [1]. Categorical time series (CTS) are a particular class of time 
series exhibiting a qualitative range which consists of a finite number of categories. 
Most of the classical statistical tools used for real-valued time series (e.g., the 
autocorrelation function) are not useful in the categorical case, so different types 
of measures than the standard ones are needed for a proper analysis of CTS. CTS 
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arise in an extensive assortment of fields [2, 3, 7, 8, 9]. Since only a few works have 
addressed the problem of CTS clustering [4, 5], the main goal of this paper is to 
introduce novel clustering algorithms for CTS. 


2 Two Novel Feature-based Approaches for Categorical Time 
Series Clustering 


Consider a set of s categorical time series S = ixi. des ,X®), where the j-th 
element xU isa T;-length partial realization from any categorical stochastic process 
(X;)rez taking values on a number r of unordered qualitative categories, which are 
coded from 1 to r so that the range of the process can be seen as V = {1,...,r}. 
We suppose that the process (X;);ez is bivariate stationary, i.e., the pairwise joint 
distribution of (X;_x, X+) is invariant in t. Our goal is to perform clustering on the 
elements of S in such a way that the series assumed to be generated from identical 
stochastic processes are placed together. To that aim, we propose two distance metrics 
which are based on feature extraction. 


2.1 Descriptive Features for Categorical Processes 


Let {X;,t € Z} be a bivariate stationary categorical stochastic process with range 
V = (1,..., r). Denote by z = (m,...,7,) the marginal distribution of X;, which 
is, P(X; = j) =a; > 0, j= L...,r. Fixed! € N, we use the notation p;;(/) = 
P(X; = i, X, = j), with i,j € V, for the lagged bivariate probability and the 
notation p;j;(/) = P(X; = i|X,., = j) = pij(I)/nj for the conditional bivariate 
probability. 

To extract suitable features characterizing the serial dependence of a given CTS, 
we start by defining the concepts of perfect serial independence and dependence 
for a categorical process. We have perfect serial independence at lag / € N if and 
only if p;;(I) = miny; for any i, j € V. On the other hand, we have perfect serial 
dependence at lag / € N if and only if the conditional distribution p.|;(/) is a 
one-point distribution for any j € V. There are several association measures which 
describe the serial dependence structure of a categorical process at lag /. One of 
such measures is the so-called Cramer's v, which is defined as 


1 x5 (pij() ^ nin) 


i,j=l 


v(I) = (1) 


Tjj 


Cramer's v summarizes the serial dependence patterns of a categorical process for 
every pair (7, j) and € N. However, this quantity is not appropriate for characterizing 
a given stochastic process, since two different processes can have the same value 
of v(/). A better way to characterize the process X; is by considering the matrix 


ict 
VO) = (VD) z jer Where Vi (I) = Gu 7:7). The elements of the matrix 


"icm 
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V(I) give information about the so-called unsigned dependence of the process. 
However, it is often useful to know whether a process tends to stay in the state it has 
reached or, on the contrary, the repetition of the same state after / steps is infrequent. 
This motivates the concept of signed dependence, which arises as an analogy of the 
autocorrelation function of a numerical process, since such quantity can take either 
positive or negative values. Provided that perfect serial dependence holds, we have 
perfect positive (negative) serial dependence if p;j;(/) = 1 (piu (D = 0) forall i € V. 

Since V(/) does not shed light on the signed dependence structure, it would 
be valuable to complement the information contained in V(/) by adding features 
describing signed dependence. In this regard, a common measure of signed serial 
dependence at lag / is the Cohen's x, which takes the form 


Er- (Pj (D - 0 
1- Ma n f 


Proceeding as with v(/), the quantity x(/) can be decomposed in order to obtain 
a complete representation of the signed dependence pattern of the process. In this 
way, we consider the vector K(/) = (9€1(D),..., X, (I)), where each K is defined 
as 


K(1) = 


(2) 


pul) — 7; 
K) = H, (3) 
1->" 
Jal j 
CH) 26248: 

In practice, the matrix V(/) and the vector K(/) must be estimated from a T-length 
realization of the process, (X1,... Xr). To this aim, we consider estimators of 7; 
and p;;(/), 7; and p;;(/), respectively, defined as 7; = A and p;;(l) = WO, 
where N; is the number of variables X, equal to i in the realization {X;1,... Xr), 


and Ni;(I) is the number of pairs (X;, X;-) = (i, j) in the realization (Xi,... Xr}. 
Hence, estimates of V(I) and K(J), V(I) and (I, respectively, can be obtained 
by plugging in the estimates 7; and p;;(/) in (2) and (3), respectively. This leads 
directly to estimates of v(/) and «(I), denoted by v(/) and «(1). 

An alternative way of describing the dependence structure of the process 
(X,,t € Z} is to take into consideration its equivalent representation as a multi- 
variate binary process. The so-called binarization of {X;,t € Z) is constructed as 
follows. Let e;,...,e- € (0,1) be unit vectors such that e, has all its entries 
equal to zero except for a one in the k-th position, k = 1,...,r. Then, the binary 
representation of {X;,t € Z} is given by the process (Y; = (Y,1,...,Y:,) ,t € Z} 
such that Y; = e; if X, = j. Fixed l € N and i,j € V, consider the correlation 
$;;(I) = Corr(Y, i, Y;-1,;), which measures linear dependence between the i-th and 
j-th categories with respect to the lag /. The following proposition provides some 
properties of the quantity $;; (1). 
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Proposition 1 


Let (X;,t € Z} be a bivariate stationary categorical process with range V = 
(L...,r). Then the following properties hold: 


1. For every i, j € V, the function ¢;; : N — [-1,1] given by l  $j;(I) = 
Corr(Y;,i, Y;-1,;) is well-defined. 
2; $i; (I) 20e pij(l) = UU. 
3. $i; (I) =4l e pis) = + ni (1 - ni)n;(1 — nj) TUN. 
zj(l-7ij 
4. D = J e pi) - 1. 


The proof of Proposition 1 is quite straightforward and it is not shown in the 
manuscript for the sake of brevity. According to Proposition 1, the quantity ¢;;(/) 
can be used to explain both types of dependence, signed and unsigned, within the 
underlying process. In fact, in the case of perfect unsigned independence at lag /, 
we have that p;;(/) = zz; for all i, j € V so that ¢;;(/) = 0 for all i, j € V in 
accordance with Property 2 of Proposition 1. Under perfect positive dependence at 
lag L, pij; (I) = 1 foralli € *V. Then ¢;;(/) = 1 foralli € V by following Property 4 of 
Proposition 1. The same property allows to conclude that ¢;;(/) = —a;/(1—7;) for all 
i € *V in the case of perfect negative dependence. In sum, ¢;;(/) evaluates unsigned 
dependence when i + j and signed dependence when i = j. The previous quantities 
can be encapsulated in a matrix (I) = (6;;(1))1<i,j<,, which can be directly 
estimated by means of $(I) = = (Gij (D)izi jar. Where each $i; (0) is computed as 


ài; nO ke eee Ue (this is derived from the proof of Proposition 1). 


Ti- Ti) T; (1- Tj) 


2.2 Two Innovative Dissimilarities Between CTS 


In this section we introduce two distance measures between categorical series based 
on the features described above. Suppose we have a pair of CTS x) and xe , and 
consider a set of L lags, £ = {l,,...,/7}. A dissimilarity based on Cramer's v and 
Cohen’s x, so-called dcc, is defined as 


L 


dcc (X9, X) = >| 


k=1 
~ ~ 2 2 
«Joop pen ep 


oe 2 
vec(W(I,)) - va, 


where the superscripts (1) and (2) are used to indicate that the corresponding 
estimations are obtained with respect to the realizations x D and X , respectively. 

An alternative distance measure relying on the binarization of the processes, 
so-called dg, is defined as 


Unsupervised Classification of Categorical Time Series 237 


p P 2 2 
vec(& 9 9 - &199)| + fe -z9 T. 


L 

1 2 
dg (Xt), xi?) =)" 
k=1 


For a given set of categorical series, the distances dcc and dg can be used 
as input for traditional clustering algorithms. In this manuscript we consider the 
Partition Around Medoids (PAM) algorithm. 


3 Partitioning Around Medoids Clustering of CTS 


In this section we examine the performance of both metrics dcc and dg in the 
context of hard clustering (i.e., each series is assigned to exactly one cluster) of CTS 
through a simulation study. 


3.1 Experimental Design 


The simulated scenarios encompass a broad variety of generating processes. In par- 
ticular, three setups were considered, namely clustering of (i) Markov Chains (MC), 
(ii) Hidden Markov Models (HMM) and (iii) New Discrete ARMA (NDARMA) pro- 
cesses. The generating models with respect to each class of processes are given below. 


Scenario 1. Clustering of MC. Consider four three-state MC, so-called MC;, MC, 
MC; and MCA, with respective transition matrices P5 P P} and P given by 


P! = Mat? (0.1,0.8, 0.1, 0.5, 0.4, 0.1, 0.6, 0.2, 0.2), 

P) = Mat? (0.1,0.8, 0.1, 0.6, 0.3, 0.1, 0.6, 0.2, 0.2), 

Pi = Mať’ (0.05, 0.90, 0.05, 0.05, 0.05, 0.90, 0.90, 0.05, 0.05), 
Pi = Mat} (1/3, 1/3, 1/3, 1/3, 1/3, 1/3, 1/3, 1/3, 1/3), 


where the operator Mat", k € N transforms a vector into a square matrix of order k 
by sequentially placing the corresponding numbers by rows. 


Scenario 2. Clustering of HMM. Consider the bivariate process ( X;, Q;); ez, where 
Q; stands for the hidden states and X; for the observable random variables. Process 
(Q;)rez constitutes an homogeneous MC. Both (X;);ez and (Q;);ez are assumed 
to be count processes with range {1,...,r}. Process (X+, Qi)rez is assumed to 
verify the three classical assumptions of a HMM. Based on previous considerations, 
let HMM;, HMM5, HMM; and HMM, be four three-state HMM with respective 
transition matrices P. PS. P: and p; and emission matrices Ej, E3, E? and E? 
given by 
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P? = Mať (0.05, 0.90, 0.05, 0.05, 0.05, 0.90, 0.90, 0.05, 0.05), P2 = P?, 
P5 = Mať (0.1,0.7,0.2, 0.4, 0.4, 0.2, 0.4, 0.3, 0.3), 
P4 = Mať (1/3, 1/3, 1/3, 1/3, 1/3, 1/3, 1/3, 1/3, 1/3), E} = P?, 
E> = Mat*(0.1,0.8, 0.1, 0.5, 0.4, 0.1, 0.6, 0.2, 0.2), E3 = EŻ, 
E? = Mat? (1/3, 1/3, 1/3, 1/3, 1/3, 1/3, 1/3, 1/3, 1/3). 


Scenario 3. Clustering of NDARMA processes. Let (X;);ez and (€;);ez, be two 
count processes with range {1,...,7} following the equation 


X, = Xo +... + Ay pXt-p + Bro t... + Brea. 


where (€;);ez is iid with P(e, = i) = z;, independent of (Xs)s<t, and the i.i.d 
multinomial random vectors 


(0,4. E Qt p; Pros m » Bia) ind MULT(1; $1, t Pps P0,- -> 9q), 


are independent of (e;);ez and (Xs)s<t. The considered models are three three-state 
NDARMA(2,0) processes and one three-state NDARMA(1,0) process with marginal 
distribution z? = (2/3, 1/6, 1/6), and corresponding probabilities in the multinomial 
distribution given by 


($1, $2, po)? P (0.7, 0.2, 0.1), ($1, $», po) = (0.1, 0.45, 0.45), 
(61, 62, po)? = (0.5, 0.25, 0.25), (61, Yo); = (0.2, 0.8). 


The simulation study was carried out as follows. For each scenario, 5 CTS of 
length T € {200,600} were generated from each process in order to execute the 
clustering algorithms twice, thus allowing to analyze the impact of the series length. 
The resulting clustering solution produced by each considered algorithm was stored. 
The simulation procedure was repeated 500 times for each scenario and value of 
T. The computation of dcc and dg was carried out by considering £ = {1} in 
Scenarios 1 and 2, and £ = (1,2) in Scenario 3. This way, we adapted the distances 
to the maximum number of significant lags existing in each setting. 


3.2 Alternative Metrics and Assessment Criteria 


To better analyze the performance of both metrics dcc and dg, we also obtained 
partitions by using alternative techniques for clustering of categorical series. The 
considered procedures are described below. 
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* Model-based approach using maximum likelihood estimation (MLE). The dis- 
tance between two CTS is defined as the squared Euclidean distance between the 
corresponding vectors of fitted coefficients via MLE (dy rg). 

* Model-based approach using mixtures. [4] propose to group a set of CTS by using 
a mixture of first order Markov models via the EM algorithm (dcz). 

* An hybrid framework for clustering CTS. [6] presents a dissimilarity between 
categorical series which evaluates both closeness between raw categorical values 
and proximity between dynamic patterns (dy y ). 


Note that the approach based on the distance djrg can be seen as a strict 
benchmark in the evaluation task. The effectiveness of the clustering approaches 
was assessed by comparing the clustering solution produced by the algorithms with 
the true clustering partition, so-called ground truth. The latter consisted of C = 4 
clusters in all scenarios, each group including the five CTS generated from the same 
process. The value C — 4 was provided as input parameter to the PAM algorithm 
in the case of dcc, dg, dye and dy vy. As for the approach dcz, a number of 4 
components were considered for the mixture model. Experimental and true partitions 
were compared by using three well-known external clustering quality indexes, the 
Adjusted Rand Index (ARI), the Jaccard Index (JI) and the Fowlkes-Mallows index 
(FMI). 


3.3 Results and Discussion 


Average values of the quality indexes by taking into account the 500 simulation trials 
are given in Tables 1, 2 and 3 for Scenarios 1, 2 and 3, respectively. 


Table 1 Average results for Scenario 1. 


T - 200 T - 600 
Method ARI JI FMI ARI JI FMI 
dcc 0.774 — 0.710 0.830 0.916 — 0.886 0.935 
dg 0.729 0.661 0.792 0.861 . 0.878 0.893 
dmve 0.704 0.633 0.772 0.841 — 0.792 0.876 
dcz 0.712 0.648 0.786 0.915 — 0.886 0.934 
dmv 0.406 0.363 0.665 0.379 0.363 0.650 


The results in Table 1 indicate that the dissimilarity dcc is the best performing one 
when dealing with MC, outperforming the MLE-based metric d jg. The distance 
dg is also superior to dm zg. The measure dcz attains in Scenario | similar results 
than dcc, specially for 7 — 600. The good performance of dcz was expected, 
since the assumption of first order Markov models considered by this metric is 
fulfilled in Scenario 1. Table 2 shows a completely different picture, indicating that 
the metrics dcc and dg exhibit a significantly better effectiveness than the rest 
of the dissimilarities. Finally, the quantities in Table 3 reveal that the model-based 
distance dy rg attains the best results when T = 200, but is defeated by dg when 
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Table 2 Average results for Scenario 2. 


T = 200 T - 600 
Method ARI JI FMI ARI JI FMI 
dcc 0.707 0.639 0.777 0.856 — 0.810 0.888 
dg 0.760 0.701 0.812 0.963 0.949 0.971 
dwrLE 0.354 0.342 0.512 0.299 0.310 0.478 
dcz 0.645 0.577 0.739 0.703 — 0.638 0.779 
dmv 0.089 0.175 0.323 0.062 0.175 0.301 


Table 3 Average results for Scenario 3. 


T= 


T = 200 T - 600 
Method ARI JI FMI ARI JI FMI 
dcc 0.627 0.563 0.715 0.875 0.837 0.903 
dg 0.680 0.612 0.754 0.925 0.901 0.941 
dmve 0.727 0.656 0.788 0.872 0.828 0.900 
dcz 0.586 0.562 0.603 0.647 . 0.577 0.738 
dmv 0.035 0.167 0.292 -0.028 — 0.138 0.251 


600. The metric dcz suffers again from model misspecification. In summary, 


the numerical experiments carried out throughout this section show the excellent 
ability of both measures dcc and dg to discriminate between a broad variety of 
categorical processes. Specifically, these metrics either outperform or show similar 
behavior than distances based on estimated model coefficients, which take advantage 
of knowing the true underlying models. 

It is worth highlighting that the methods proposed in this paper could have 
promising applications in some fields as the clustering of genetic data sequences. 
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Fuzzy Clustering by Hyperbolic Smoothing 


David Masís, Esteban Segura, Javier Trejos, and Adilson Xavier 


Abstract We propose a novel method for building fuzzy clusters of large data sets, 
using a smoothing numerical approach. The usual sum-of-squares criterion is relaxed 
so the search for good fuzzy partitions is made on a continuous space, rather than a 
combinatorial space as in classical methods [8]. The smoothing allows a conversion 
from a strongly non-differentiable problem into differentiable subproblems of op- 
timization without constraints of low dimension, by using a differentiable function 
of infinite class. For the implementation of the algorithm, we used the statistical 
software R and the results obtained were compared to the traditional fuzzy C—means 
method, proposed by Bezdek [1]. 


Keywords: clustering, fuzzy sets, numerical smoothing 


1 Introduction 


Methods for making groups from data sets are usually based on the idea of disjoint 
sets, such as the classical crisp clustering. The most well known are hierarchical 
and k-means [8], whose resulting clusters are sets with no intersection. However, 
this restriction may not be natural for some applications, where the condition for 
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some objects may be to belong to two or more clusters, rather than only one. Several 
methods for constructing overlapping clusters have been proposed in the literature 
[4, 5, 8]. Since Zadeh introduced the concept of fuzzy sets [17], the principle of 
belonging to several clusters has been used in the sense of a degree of membership 
to such clusters. In this direction, Bezdek [1] introduced a fuzzy clustering method 
that became very popular since it solved the problem of representation of clusters 
with centroids and the assignment of objects to clusters, by the minimization of 
a well-stated numerical criterion. Several methods for fuzzy clustering have been 
proposed in the literature; a survey of these methods can be found in [16]. 

In this paper we propose a new fuzzy clustering method based on the numerical 
principle of hyperbolic smoothing [15]. Fuzzy C-Means method is presented in 
Section 2 and our proposed Hyperbolic Smoothing Fuzzy Clustering method in 
Section 3. Comparative results between these two methods are presented in Section 
4. Finally, Section 5 is devoted to the concluding remarks. 


2 Fuzzy Clustering 


The most well known method for fuzzy clustering is the original Bezdek's C-means 
method [1] and it is based on the same principles of k-means or dynamical clusters 
[2], that is, iterations on two main steps: i) class representations by the optimization 
of a numerical criterion, and ii) assignment to the closest class representative in 
order to construct clusters; these iterations are made until a convergence is reached 
to a local minimum of the overall quality criterion. 

Let us introduce the notation that will be used and the numerical criterion for 
optimization. Let X be an n x p data matrix containing p numerical observations 
over n objects. We look for a K x p matrix G that represents centroids of K clusters 
of the n objects and an n x K membership matrix with elements jy € [0,1], such 
that the following criterion is minimized: 


n K 
W(X,U,C) 2 9. $, Qu" Ixi- gl 
i=l k=l (1) 
subject to ar Mik = 1, for alli € {1,2,...,n} 
0 « X71 uix <n, forall k € {1,2,...,K}, 


where x; is the i-th row of X and g, is the k-th row of G, representing in RP the 
centroid of the k-th cluster. 

The parameter m + 1 in (1) controls the fuzzyness of the clusters. According to 
the literature [16], it is usual to take m = 2, since greater values of m tend to give 
very low values of uj, tending to the usual crisp partitions such as in k-means. We 
also assume that the number of clusters, K, is fixed. 

Minimization of (1) represents a non linear optimization problem with constraints, 
which can be solved using Lagrange multipliers as presented in [1]. The solution, 
for each row of the centroids matrix, given a matrix U, is: 
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gy = b (Mik) Xi » (uix). (2) 
izl i=l 


The solution for the membership matrix, given a matrix centroids G, is [1]: 


efes | 
i^ Sk 
EN ela als 3) 
» (oer | 


The following pseudo-code shows the mains steps of Bezdek’s Fuzzy C-Means 
method [1]. 


Bezdek’s Fuzzy c-Means (FCM) Algorithm 


. Initialize fuzzy membership matrix U = [uix]nxk 

. Compute centroids for fuzzy clusters according to (2) 

. Update membership matrix U according to (3) 

. If improvement in the criterion is less than a threshold, then stop; otherwise go 
to Step 2. 


AUN- 


Fuzzy C-Means method starts from an initial partition that is improved in each 
iteration, according to (1), applying Steps 2 and 3 of the algorithm. It is clear that 
this procedure may lead to local optima of (1) since iterative improvement in (2) and 
(3) is made by a local search strategy. 


3 Algorithm for Hyperbolic Smoothing Fuzzy Clustering 


For the clustering problem of the n rows of data matrix X in K clusters, we can seek 
for the minimum distance between every x; and its class center g: 


2 ; 2 
Z; = min |X; — 
7 = min lx; ~ gl 


where || - |; is the Euclidean norm. The minimization can be stated as a sum-of- 
squares: 


n n 
min)" min ||x; -gll = min) z 
i=l g, eG i=l 


leading to the following constrained problem: 


n 
min)" e subject to z; = min ||x; ^ g;|l2, with = 1,...,n. 
l gy EG 


į=1 
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This is equivalent to the following minimization problem: 


n 
min)" z subject to z; — |x; ^ gil € 0, with i = 1,...,n and k =1,...,K. 


i=1 


Considering the function: p(y) = max(0, y), we obtain the problem: 


min z subject to »3 (zi - |xi ^ gillo) =O fori = 1,...,n. 


n K 
i=l k=1 


That problem can be re-stated as the following one: 
n K 

min z subject to e (zi — |xi ^ gil) > 0, fori = 1,.... n. 
i=l k=l 


Given a perturbation e > 0 it leads to the problem: 


n K 


min)" e subject to 2; (zi - |xi ^ gula) 2 e fori 2 1,...,n. 
i=l k=1 


It should be noted that function ọ is not differentiable. Therefore, we will make 
a smoothing procedure in order to formulate a differentiable function and pro- 
ceed with a minimization by a numerical method. For that, consider the func- 


Ay? 2 
tion: (y, T) = ILU. 


na (xij — xj)? + y?, for y > 0. Hence, the minimization problem is trans- 


, for all y € R, t > O, and the function: 6(x;,g;,y) = 


formed into: 
n K 

min 2 subject to 2 V (zi - 0(Xi Zg, y). T) 2 e, fori 2 1,...,n. 
i=l k=1 


Finally, according to the Karush-Kuhn- Tucker conditions [10, 11], all the con- 
straints are active and the final formulation of the problem is: 


n 
min 23 z 
i=1 
K 


subject to h;(z;,G) = 2; W(zi — 0(xi,.gy. y). T) - e-0, fori=1,...,n, 
k=1 
€,T,y >Q. 


(4) 


Considering (4), in [15] it was stated the Hyperbolic Smoothing Clustering Method 
presented in the following algorithm. 
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Hyperbolic Smoothing Clustering Method (HSCM) Algorithm 


. Initialize cluster membership matrix U = [uix]nxk 

. Choose initial values: G°, y! tle! 

. Choose values: 0 < pı < 1,0 < p2 < 1,0 < p3 «1 

. Letl=1 

. Repeat steps 6 and 7 until a stop condition is reached: 


fon nABWN Re 


. Solve problem (P): min f(G) = 2; z with y = y', t = T! and e = e, GU! 
i-l 
being the initial value and G’ the obtained solution 
7. Lety”! = pry!, t+! = pot!, eft! = puel andi = L4 1. 


The most relevant task in the hyperbolic smoothing clustering method is finding 
the zeroes of the function h;(z;,G) = JK W(zi — 0(Xi, gy, y). T) — € = 0 for 
for i = 1,...,n. In this paper, we used the Newton-Raphson method for finding 
these zeroes [3], particularly the BFGS procedure [12]. Convergence of the Newton- 
Raphson method was successful, mainly, thank to a good choice of initial solutions. 
Inour implementation, these initial approximations were generated by calculating the 
minimum distance between the i-th object and the k-th centroid for a given partition. 
Once the zeroes z; of the functions h; are obtained, it is implemented the hyperbolic 
smoothing. The final solution for this method consists on solving a finite number 
of optimization subproblems corresponding to problem (P) in Step 6 of the HSCM 
algorithm. Each one of these subproblems was solved with the R routine optim [13], 
a useful tool for solving optimization problems in non linear programming. As far 
as we know there is no closed solution for solving this step. For the future, we can 
consider writing a program by our means, but for this paper we are using this R 
routine. 

Since we have that: Dia V (zi — 0 (Xi, gy. y). T) = e, then each entry uig of the 
membership matrix is given by: Uik = eee It is worth to note that fuzzyness 
is controlled by parameter e. 

The following algorithm contains the main steps of the Hyperbolic Smoothing 
Fuzzy Clustering (HSFC) method. 


Hyperbolic Smoothing Fuzzy Clustering (HSFC) Algorithm 

1. Sete» 0 

2. Choose initial values for: G? (centroids matrix), y. t! and N (maximum number 
of iterations) 

. Choose values: 0 < pı «1, 0« pm «1 

. Seti z 1 

. While / < N: 

. Solve the problem (P): Minimize f(G) = Y, 2 with y = y and t = 7, 
with an initial point G^ and G( being the obtained solution 

7. Set y(*D = gy (D f (D = por y 15 141 

8. Set uik = V(zi - 0(Xi, gy, y), T)/efori 2 1,...,nand k 2 L,..., K. 


NN RU 
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4 Comparative Results 


Performance of the HSFC method was studied on a data table well known from the 
literature, the Fisher's iris [7] and 16 simulated data tables built from a semi-Monte 
Carlo procedure [14]. 

For comparing FCM and HSFC, we used the implementation of FCM in R 
package fclust [6]. This comparison was made upon the within class sum-of-squares: 
W(P) = 27 Ma HikllXi — g, ll^. Both methods were applied 50 times and the 
best value of W is reported. For simplicity here, for HSFC we used the following 
parameters: p1 = p» = p3 = 0.25, e = 0.01 and y = rv = 0.001 as initial values. In 
Table 1 the results for Fisher's iris are shown, in which case HSFC performs slightly 
better. It contains the Adjusted Rand Index (ARI) [9] between HSFC and the best 
FCM result among 100 runs; ARI compares fuzzy membership matrices crisped into 
hard partitions. 


Table 1 Minimum sum-of-squares (SS) reported for the Fisher's iris data table with HSFC and 
FCM, K being the number of clusters, ARI comparing both methods. In bold best method. 


Table K _ SS for HSFC SS for FCM ARI 
2 152.348 152.3615 1 

Fisher’s iris 3  78.85567 78.86733 0.994 
4 57.26934 57.26934 0.980 


Simulated data tables were generated in a controlled experiment as in [14], with 
random numbers following a Gaussian distribution. Factors of the experiment were: 
* The number of objects (with 2 levels, n = 105 and n = 525). 

e The number of clusters (with levels K = 3 and K = 7). 

e Cardinality (card) of clusters, with levels i) all with the same number of objects 
(coded as card(=)), and ii) one large cluster with 50% of objects and the rest with 
the same number (coded as card(#)). 

* Standard deviation of clusters, with levels i) all Gaussian random variables with 
standard deviation (SD) equal to one (coded as SD(2)), and ii) one cluster with 
SD=3 and the rest with SD=1 (coded as SD(z)). 


Table 2 contains codes for simulated data tables according to the codes we used. 

Table 3 contains the minimum values of the sum-of-squares obtained for our 
HSFC and Bezdek's FCM methods; the best solution of 100 random applications 
for FCM in presented and one run of HSFC. It also contains the ARI values for 
comparing HSFC solution with that best solution of FCM. It can be seen that, 
generally, HSFC method tends to obtain better results than FCM, with only few 
exceptions. In 23 cases HSFC obtains better results, FCM is better in 5 cases, and 
results are in same in 17 cases. However, ARI shows that partitions tend to be very 
similar with both methods. 
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Table 2 Codes and characteristics of simulated data tables; n: number of objects, K: number of 
clusters, card: cardinality, DS: standard deviation. 


Table Characteristcs Table Characteristcs 

TI n2525,K -3,card(2), SD(=) T9 n=525, K = 3, card(#), DS(=) 
T2 n=525, K =7, card(=), SD(=) TIO n=525, K =7, card(#), DS(=) 
T3 n=105, K = 3, card(=), SD(=) T11 n= 105, K =3, card(#), DS(=) 
T4 n= 105, K =7, card(=), SD(=) T12 n= 105, K = 7, card(+), DS(=) 
T5 n= 525, K = 3, card(=), SD(z) T13 n = 525, K = 3, card(#), DS(+) 
T6 n= 525, K = 7, card(=), SD(z) T14 n= 525, K = 7, card(#), DS(+) 
T7 n= 105, K = 3, card(=), SD(z) T15 n= 105, K = 3, card(#), DS(+) 
T8 n= 105, K =7, card(=), SD(z) T16 n= 105, K = 7, card(#), DS(+) 


Table 3 Minimum sum-of-squares (SS) reported for HSFC and FCM methods on the simulated 
data tables. Best method in bold. 


Table K SS for SS for ARI Table K SS for SS for ARI 
HSFC FCM HSFC FCM 

2 7073.402 7073.814 0.780 2 12524.31 12524.31 0.900 
Tl 3 3146.119 3146.119 1 T9 3 9269.361 9269.611 1 

4 2983.651 2983.651 1 4 6298.47 6298.368 1 

2 16987.19 16987.71 0.764 2 5466.893 5466.912 0.890 
T2 3 1165322 11653.22 1 T10 3 2977.58 2977.58 1 

4 7776.855 7777.396 1 4 2745.721 2746.671 1 

2 3923.051 3923.062 0.763 2 2969.247 2969.32 0.860 
T3 3 2917.13 2917.13 0.754 T11 3 1912.323 1912.323 1 

4 2287.523 2256.298 0.993 4 1401.394 1401.394 1 

2 1720.365 1720.374 0.992 2 1816.056 1816.056 1 
T4 3 569.3112 569.3112 1 T12 3 525.7118 525.7118 1 

4 535.5491 535.3541 1 4 477.0593 477.2696 1 

2 15595.67 15595.67 0.910 2 12804.03 12805.05 0.920 
T5 3 1172493 11725.28 1 T13 3 8816.805 8817.702 1 

4 8409.738 8409.738 0.984 4 6293.774 6293.951 1 

2 11877.96 11877.96 0.970 2 16228.07 16228.98 0.920 
T6 3 8299.779 8300.718 1 T14 3 7255.113 7255.423 1 

4 7212.611 7213.725 1 4 6427.313 6427.313 1 

2 4336.261 4336.507 0.955 2 2616.286 2616.943 1 
T7 3 3041.076 3041.076 1 T15 3 1978.017 1978.233 1 

4 2395.683 2421.333 1 4 1526.895 1526.953 1 

2 1767.43 1767.43 1 2 2226.923 2226.212 0.962 
T8 3 1380.766 1381.019 1 T16 3 1232.074 1232.124 1 

4 1215.302 1211.235 1 4 982.7074 982.9721 1 
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5 Concluding Remarks 


In hyperbolic smoothing, parameters T, y and e tend to zero, so the constraints in 
the subproblems make that problem (P) tends to solve (1). Parameter e controls the 
fuzzyness degree in clustering; the higher it is, the solution becomes more and more 
fuzzy; the less it is, the clustering is more and more crisp. In order to compare results 
and efficiency of the HSFC method, zeroes of functions h; can be obtained with any 
method for solving equations in one variable or a predefined routine. According to 
the results we obtained so far and the implementation of the hyperbolic smoothing 
for fuzzy clustering, we can conclude that, generally, the HSFC method has a slightly 
better performance than original Bezdek's FCM on small real and simulated data 
tables. Further research is required for testing performance of HSFC method on very 
large data sets, with measures of efficiency, quality of solutions and running time. 
We are also considering to study further comparisons between HSFC and FCM with 
different indices, and writing the program for solving Step 6 in HSFC algorithm, that 
is the minimization of f (G), by our means, instead of using the optim routine in R. 
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Stochastic Collapsed Variational Inference for 
Structured Gaussian Process Regression 
Networks 


Rui Meng, Herbert K. H. Lee, and Kristofer Bouchard 


Abstract This paper presents an efficient variational inference framework for a 
family of structured Gaussian process regression network (SGPRN) models. We 
incorporate auxiliary inducing variables in latent functions and jointly treat both 
the distributions of the inducing variables and hyper-parameters as variational pa- 
rameters. Then we take advantage of the collapsed representation of the model and 
propose structured variational distributions, which enables the decomposability of a 
tractable variational lower bound and leads to stochastic optimization. Our inference 
approach is able to model data in which outputs do not share a common input set, and 
with a computational complexity independent of the size of the inputs and outputs 
to easily handle datasets with missing values. Finally, we illustrate our approach on 
both synthetic and real data. 


Keywords: stochastic optimization, Gaussian process, variational inference, multi- 
variate time series, time-varying correlation 


1 Introduction 


Multi-output regression problems arise in various fields. Often, the processes that 
generate such datasets are nonstationary. Modern instrumentation has resulted in 
increasing numbers of observations, as well as the occurrence of missing values. 
This motivates the development of scalable methods for forecasting in such datasets. 

Multi-ouput Gaussian process models or multivariate Gaussian process models 
(MGP) generalise the powerful Gaussian process predictive model to vector-valued 
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random fields [1]. Those models demonstrate improved prediction performance com- 
pared with independent univariate Gaussian processes (GP) because MGPs express 
correlations between outputs. Since the correlation information of data is encoded in 
the covariance function, modeling the flexible and computationally efficient cross- 
covariance function is of interest. In the literature of multivariate processes, many 
approaches are proposed to build valid cross-covariance functions including the 
linear model of coregionalization (LMC) [2], kernel convolution techniques [3], B- 
spline based coherence functions [4]. However, most of these models are designed 
for modelling low-dimensional stationary processes, and require Monte Carlo sim- 
ulations, making inference in large datasets computationally intractable. 

Modelling the complicated temporal dependencies across variables is addressed in 
[5, 6] by several adaptions of stochastic LMC. Such models can handle input-varying 
correlation across multivariate outputs. Especially for multivariate time series, [6] 
propose a SGPRN that captures time-varying scale, correlation, and smoothness. 
However, the inference in [6] is difficult to handle in applications where either the 
number of observations and dimension size are large or where missing data exist. 

Here, we propose an efficient variational inference approach for the SGPRN by 
employing the inducing variable framework on all latent processes [7], taking ad- 
vantage of its collapsed representation where nuisance parameters are marginalized 
out [8] and proposing a tractable variational bound amenable to doubly stochastic 
variational inference. We call our approach variational SGPRN (VSGPRN). This 
variational framework allows the model to handle missing data without increasing 
the computational complexity of inference. We numerically provide evidence of the 
benefits of simultaneously modeling time-varying correlation, scale and smoothness 
in both a synthetic experiment and a real-world problem. 

The main contributions of this work are threefold: 


* Learning structured Gaussian process regression networks using inducing vari- 
ables on both mixing coefficients and latent functions. 

* Employing doubly stochastic variational inference for structured Gaussian pro- 
cess regression networks by taking advantage of its collapsed representation and 
constructing a tractable lower bound of the loglikelihood, making it suitable for 
mini-batching learning. 

* Demonstrating that our proposed algorithm succeeds in handling time-varying 
correlation on missing data under different scenarios in both synthetic data and 
real data. 


2 Model 


Assume y(x) € RP is a vector-valued function of x € R^, where D is the di- 
mension size of the outputs and P is the dimension size of the inputs. SGPRN 
assumes that noisy observations y (x) are the linear combination of latent variables 
g(x) € RP, corrupted by Gaussian noise e(x). The coefficients L(x) € R?*? 
of the latent functions are assumed to be a stochastic lower triangular matrix with 
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(a) Generative Model (b) Variational structure 


Fig. 1 Graphical model of VSGPRN. Left: Illustration of the generative model. Right: Illustration 
of the variational structure. The dashed (red) block means that we marginalize out those latent 
variables in the variational inference framework. 


positive values on the diagonal for model identification [9, 6]. Thus, SGPRN is 
defined in the generative model of Figure 1 and itis y(x) = f(x) +e(x), f(x)- 
IL(x)g(x) with independent white noise e(x) id Ny (0,02,,I). We note that 
each latent function gg in g is independently sampled from a GP with a non- 
stationary kernel K and the stochastic coefficients are modeled via a struc- 
tured GP based prior as proposed in [9] with a stationary kernel K' such that 
ii GP 0, K! n i>], 

ga ‘ff GP(0,K8),d = 1,...,D, and lj; ~ ( pi AG 
logGP(0, K&), i=j, 


denotes the log Gaussian process [10]. K£ is modelled as a Gibbs correlation func- 


tion K&(x,x’) = rere OP | +=] ,€ ~ logGP(0, K^) , where £ 
determines the input-dependent length scale of the shared correlations in K£ for all 
latent functions gg. The varying length-scale process £ plays an important role in 
modelling nonstationary time series as illustrated in [11, 6]. 

Let X = [xps be the set of observed inputs and Y = GM, be the set 
of observed outputs. Denote ņ as the concatenation of all coefficients and all log 
length-scale parameters, i.e., 7 = (1, £) evaluated at training inputs X. Here, I is a 
vector including the entries below the main diagonal and the entries on the diagonal 
in the log scale and £ = log@ is the length-scale parameters in log scale. Also, 
denote 0 = (07,07, 02,,.) as all hyper-parameters, where 6; and 6, are the hyper- 
parameters in kernel K; and Ke. We note that directly inferring the posterior of the 
latent variables p (| Y, 0) œ p(Y [p 02, ) p(1]01, 07) is computationally intractable 
in general because the computational complexity of p(g| Y,0) is O(N?D?). To 
overcome this issue, we propose an efficient variational inference to significantly 


reduce the computational burden in the next section. 


where logGP 
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3 Inference 


We introduce a shared set of inducing inputs Z — {2m}; that lie in the same space 
as the inputs X and a set of shared inducing variables wq for each latent function 
ga evaluated at the inducing inputs Z. Likewise, we consider inducing variables u;; 
for the function log Li; when i = j, u;; for function L;; when i > j, and inducing 
variables v for function log £(x) evaluated at inducing inputs Z. We denote those 
collective variables as 1 = (lij;sj, u = (uiis, 8 = (Esp we {wa}? t 
and v. Then we redefine the model parameters 7 = (1, u, g, w, £, v), and the prior 
of those model parameters is p(7) = p(1|w) p(w) p(g|u, £, v)p(u) p(€|v) p(v). 

The core assumption of inducing point-based sparse inference is that the inducing 
variables are sufficient statistics for the training and testing data in the sense that the 
training and testing data are conditionally independent given the inducing variables. 
In the context of our model, this means that the posterior processes of L, g and £ are 
sufficiently determined by the posterior distribution of u, w and v. We propose a 
structured variational distribution and its corresponding variational lower bound. Due 
to the nonconjugacy of this model, instead of doing expectation in the evidence lower 
bound (ELBO), as is normally done in the literature, we perform the marginalization 
on inducing variables u, w and g, and then use the reparameterization trick to 
apply end-to-end training with stochastic gradient descent. We will also discuss a 
procedure for missing data inference and prediction. 

To capture the posterior dependency between the latent functions, we propose a 
structured variational distribution of the model parameters 7 used to approximate its 
posterior distribution as q(7) = p(l|u)p(glw, £, v)p(£|v)q(u, w, v) . This varia- 
tional structure is illustrated in Figure 1. The variational distribution of the inducing 
variables q(u, w, v) fully characterizes the distribution of q(7). Thus, the inference 
of q(u, w, v) is of interest. We assume the parameters u, w, and v are Gaussian 
and mutually independent. 

Given the definition of Gaussian process priors for the SGPRN, the conditional 
distributions p(1|u), p(g|w, £, v), and p(£|v) have closed-form expressions and all 
are Gaussian, except for p(£|v), which is log Gaussian. The ELBO of the log like- 
lihood of observations under our structured variational distribution g(7) is derived 
using Jensen’s inequality as: 


log p(Y) 2 Eq =R+A, (1) 


- (reus SENSA) 


q(u,w, v) 


where R = 35 bem Eq(g, 1) log(pCynalga, 15)) is the reconstruction term and 
A = KL(q(u)|lp(u)) + KL(g(w)||p(w)) + KL(g(v)||p(w)) is the regularization 
term. 2, = {gan = (ga)n)5. , and ln = (lij, = (lij)n yiz; are latent variables. 

The structured decomposition trick for g(7) has also been used by [12] to derive 
variational inference for the multivariate output case. The benefit of this structure 
is that all conditional distributions in g(7) can be cancelled in the derivation of the 
lower bound in (1), which alleviates the computational burden of inference. Because 
of the conditional independence of the reconstruction term in (1) given g and 1, the 
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lower bound decomposes across both inputs and outputs and this enables the use 
of stochastic optimization methods. Moreover, due to the Gaussian assumption in 
the prior and variational distributions of the inducing variables, all KL divergence 
terms in the regularization term A are analytically tractable. Next, instead of directly 
computing expectation, we leverage stochastic inference [13]. 

Stochastic inference requires sampling of 1 and g from the joint variational 
posterior g(7). Directly sampling them would introduce much uncertainty from 
intermediate variables and thus make inference inefficient. To tackle this is- 
sue, we marginalize unnecessary intermediate variables u and w and obtain the 
marginal distributions q(1) = []j;-; log A (Lili; £2) Mis; N (Lla Ži) and 
q(elé,v) = II, N (galã, &£) with a joint distribution q(£, v) = p(£|v)q(v), 
where the conditional mean and covariance matrix are easily derived. The corre- 
sponding marginal distributions q(1,) and q(g,|€,v) at each n are also easy to 
derive. Moreover, we conduct collapsed inference by marginalizing the latent vari- 
ables æn, so then the individual expectation is 


His. ioe Gus bye J E (6) ala.) EE 


where Lng = log N nal ZF lajnf p Terr) sural D [2&8 measure the 


202,, ^ Jj-l djn jn 
reconstruction performance for observations y 4. 

Directly evaluating the ELBO is still challenging due to the non-linearities in- 
troduced by our structured prior. Recent progress in black box variational inference 
[13] avoids this difficulty by computing noisy unbiased estimates of the gradient of 
ELBO, via approximating the expectations with unbiased Monte Carlo estimates and 
relying on either score function estimators [14] or reparameterization gradients [13] 
to differentiate through a sampling process. Here we leverage the reparameterization 
gradients for stochastic optimization for model parameters. We note that evaluating 
ELBO (1) involves two sources of stochasticity from Monte Carlo sampling in (2) 
and from data sub-sampling stochasticity [15]. The prediction procedure is based on 
Bayes' rule and replaces the posterior distribution by the inferred variational distribu- 
tion. In the case of missing data, the only modification in (1) is in the reconstruction 
term, where we sum up the likelihoods of observed data instead of complete data. 


4 Experiments 


This section illustrates the performance of our model on multivariate time series. We 
first show that our approach can model the time-varying correlation and smoothness 
of outputs on 2D synthetic datasets in three scenarios with respect to different types of 
frequencies but the same missing data mechanism. Then, we compare the imputation 
performance on missing data with other inducing-variable based sparse multivariate 
Gaussian process models on a real dataset. 
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We conduct experiments on three synthetic time series with low frequency 
(LF), high frequency (HF) and varying frequency (VF) respectively. They are 
generated from the system of equations y;(t) = 5cos(2zwt^) + ei(t) , yo(t) = 
5(1 — t) cos(2zwt?^) — 5t cos(2ztwt^) + e(t), where (e (0, are independent stan- 
dard white noise processes. The value of w refers to the frequency and the value of 
5 characterizes the smoothness. The LF and HF datasets use the same s — 1, imply- 
ing the smoothness is invariant across time. But they employ different frequencies, 
w = 2 for LF and w = 5 for HF (i.e., two periods and five periods in a unit time 
interval respectively). The VF dataset takes s = 2 and w = 5, so that the frequency 
of the function is gradually increasing as time increases. For all three datasets, the 
system shows that as time f increases from 0 to 1, the correlation between yj (t) and 
y2(t) gradually varies from positive to negative. Within each dataset, we randomly 
select 200 training data points, in which 100 time stamps are sampled on the interval 
(0, 0.8) for the first dimension and the other 100 time stamps sampled on the interval 
(0.2, 1) for the second dimension. For the test inputs, we randomly select 100 time 
stamps on the interval (0, 1) for each dimension. 


Table 1 Prediction measurements on three synthetic datasets and different models. LF, HF and VF 
refer to low-frequency, high-frequency, and time-varying datasets. Three prediction measures are 
root mean square error (RMSE), average length of confidence interval (ALCI), and coverage rate 
(CR). All three measurements are summarized by the mean and standard deviation across 10 runs 
with different random initializations. 


Data Model RMSE ALCI CR 
IGPR [16] |2.25(1.33e-13)|2.18(1.88e-13) 0.835(0) 
LF ICM [17] |2.26(2.54e-5) | 2.18(1.22e-5) 0.835(0) 
CMOGP [12]! 1.43(6.12e-2) | 1.36(1.98e-1) | 0.651(3.00e-2) 
VGPRN [18]| 1.01(0.31) - - 
VSGPRN | 1.00(1.43e-1) | 2.21(6.56e-2) | 0.892(1.63e-2) 
IGPR [16] |1.51(6.01e-14)|3.17(1.30e-13)|0.915(2.22e-16) 
HF ICM [17] | 1.52(1.01e-5) | 3.17(1.19e-5) 0.910(0) 
CMOGP [12]! 1.29(3.04e-2) | 2.34(3.31e-1) | 0.729(3.07e-2) 
VGPRN [18]| 1.11(0.25) - - 
VSGPRN | 1.10(1.98e-1) | 2.74(7.94e-2) | 0.930(1.14e-2) 
IGPR [16] |1.64(8.17e-14)|3.19(3.02e-13) 0.875(0) 
VF ICM [17] | 1.66(2.37e-3) | 3.16(1.49e-3) | 0.880(1.50e-3) 
CMOGP [12]! 2.24(3.08e-1) | 2.56(9.29e-1) | 0.697(1.56e-1) 
VGPRN [18]| | 1.04(0.67) - - 
VSGPRN | 1.24(1.33e-1) | 2.92(1.21e-1) | 0.887(9.80e-3) 


We quantify the model performance in terms of root mean square error (RMSE), 
average length of confidence interval (ALCI), and coverage rate (CR) on the test set. 
A smaller RMSE corresponds to better predictive performance of the model, and 
a smaller ALCI implies a smaller predictive uncertainty. As for CR, the better the 
model prediction performance is, the closer CR is to the percentile of the credible 
band. Those results are reported by the mean and standard deviation with 10 differ- 
ent random initializations of model parameters. Quantitative comparisons relating 
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to all three datasets are in Table 1. We compare with independent Gaussian process 
regression (IGPR) [16], the intrinsic coregionalization model (ICM) [17], Collab- 
orative Multi-Output Gaussian Processes (CMOGP) [12] and variational inference 
of Gaussian process regression networks [18] on three synthetic datasets. In both 
CMOGP and VSGPRN approaches, we use 20 inducing variables. We further exam- 
ined model predictive performance on a real-world dataset, the PM2.5 dataset from 
the UCI Machine Learning Repository [19]. This dataset tracks the concentration of 
fine inhalable particles hourly in five cities in China, along with meteorological data, 
from Jan 1st, 2010 to Dec 31st, 2015. We compare our model with two sparse Gaus- 
sian process models, i.e., independent sparse Gaussian process regression (ISGPR) 
[20] and the sparse linear model of corregionalization (SLMC) [17]. In the dataset, 
we consider six important attributes and use 20% of the first 5000 standardized mul- 
tivaritate for training and use the others for testing. The RMSEs on the testing data 
are shown in Table 2, illustrating that VSGPRN had better prediction performance 
compared with ISGPR and SLMC, even when using fewer inducing points. 


Table 2 Empirical results for PM2.5 dataset. Each model's performance is summarized by its 
RMSE on the testing data. The number of equi-spaced inducing points is given in parentheses. 
Data |ISGPR (100) [20]| SLMC (100) [17]| VSGPRN (50) | VSGPRN (100) VSGPRN (200) 
PM2.5 0.994 0.948 0.840 0.708 0.625 


5 Conclusions 


We propose a novel variational inference approach for structured Gaussian process 
regression networks named the variational structured Gaussian process regression 
network, VSGPRN. We introduce inducing variables and propose a structured varia- 
tional distribution to reduce the computational burden. Moreover, we take advantage 
of the collapsed representation of our model and construct a tractable lower bound of 
the log likelihood to make it suitable for doubly stochastic inference and easy to han- 
dle missing data. In our method, the computation complexity is independent of the 
size of the inputs and the outputs. We illustrate the superior predictive performance 
for both synthetic and real data. 

Our inference approach, VSGPRN can be widely used for high dimensional time 
series to model complicated time-varying dependence across multivariate outputs. 
Moreover, due to its scalability and flexibility, it can be widely applied for irregu- 
larly sampled incomplete large datatsets that widely exist in various research fields 
including healthcare, environmental science and geoscience. 
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An Online Minorization-Maximization 
Algorithm 


Hien Duy Nguyen, Florence Forbes, Gersende Fort, and Olivier Cappé 


Abstract Modern statistical and machine learning settings often involve high data 
volume and data streaming, which require the development of online estimation 
algorithms. The online Expectation-Maximization (EM) algorithm extends the pop- 
ular EM algorithm to this setting, via a stochastic approximation approach. We show 
that an online version of the Minorization-Maximization (MM) algorithm, which in- 
cludes the online EM algorithm as a special case, can also be constructed in a similar 
manner. We demonstrate our approach via an application to the logistic regression 
problem and compare it to existing methods. 


Keywords: expectation-maximization, minorization-maximization, parameter esti- 
mation, online algorithms, stochastic approximation 


1 Introduction 


Expectation-Maximization (EM) [6, 17] and Minorization-Maximization (MM) 
algorithms [15] are important classes of optimization procedures that allow for 
the construction of estimation routines for many data analytic models, including 
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many finite mixture models. The benefit of such algorithms comes from the use of 
computationally simple surrogates in place of difficult optimization objectives. 

Driven by high volume of data and streamed nature of data acquisition, there 
has been a rapid development of online and mini-batch algorithms that can be used 
to estimate models without requiring data to be accessed all at once. Online and 
mini-batch versions of EM algorithms can be constructed via the classic Stochastic 
Approximation framework (see, e.g., [2, 13]) and examples of such algorithms 
include those of [3, 7, 8, 10, 11, 12, 19]. Via numerical assessments, many of the 
algorithms above have been demonstrated to be effective in mixture model estimation 
problems. Online and mini-batch versions of MM algorithms on the other hand 
have largely been constructed following convex optimizations methods (see, e.g., 
[9, 14, 23]) and examples of such algorithms include those of [4, 16, 18, 22]. 

In this work, we provide a stochastic approximation construction of an online 
MM algorithm using the framework of [3]. The main advantage of our approach is 
that we do not make convexity assumptions and instead replace them with oracle 
assumptions regarding the surrogates. Compared to the online EM algorithm of [3] 
that this work is based upon, the Online MM algorithm extends the approach to allow 
for surrogate functions that do not require latent variable stochastic representations, 
which is especially useful for constructing estimation algorithms for mixture of 
experts (MoE) models (see, e.g. [20]). We demonstrate the Online MM algorithm 
via an application to the MoE-related logistic regression problem and compare it to 
competing methods. 

Notation. By convention, vectors are column vectors. For a matrix A, AT denotes 
its transpose. The Euclidean scalar product is denoted by (a, b). For a continuously 
differentiable function 6 — h(@) (resp. twice continuously differentiable), Vgh (or 
simply V when there is no confusion) is its gradient (resp. V2 g is its Hessian). We 
denote the vectorization operator that converts matrices to column vectors by vec. 


2 The Online MM Algorithm 


Consider the optimization problem 


arg max E [f (6; X)], (1) 
0eT 


where T is a measurable open subset of RP, X is a topological space endowed 
with its Borel sigma-field, f : T x X — R is a measurable function and X is a 
X-valued random variable on the probability space (Q, 7, P). In this paper, we are 
interested in the setting when the expectation E [ f (0; X)] has no closed form, and 
the optimization problem is solved by an MM-based algorithm. 

Following the terminology of [15], we say that g : TxXxT, (0,x, T)  g (0,x; T), 
is a minorizer of f, if for any r € T and for any (6,x) € T x X, it holds that 


f(8;x) — f(t x) = g(0,x;T) — g(7, x; T). (2) 
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In our work, we consider the case when the minorizer function g has the following 
structure: 


Al The minorizer surrogate g is of the form: 
g (0,xi T) = — (0) + (S(T; x), (4) . (3) 


where y : T > R, 6: T — R7 and $ : Tx X — R4 are measurable functions. 
In addition, ø and y are continuously differentiable on T. 


We also make the following assumptions: 


A2 There exists a measurable open and convex set $ C R such that for any s € S, 
y € [0, 1) and any (t,x) e Tx X: 


st+y {S(t;x) - s} € S. 


A3 The expectation E[S(0; X)] exists, is in S, and is finite whatever 0 € T but it 
may have no closed form. Online independent oracles (X,, n > 0), with the same 
distribution as X, are available. 

A4 For any s € S, there exists a unique root to 6 œ> -Vy (0) + V$(0)' s, which 
is the unique maximum on T of the function 0  —w(6) + (s, 6(@)). This root is 
denoted by 6(s). 


Seen as a function of 6, g(-,x;T) is the sum of two functions: —w and a linear 
combination of the components of ¢ = ($1,..., a). Assumption Al implies that 
the minorizer surrogate is in a functional space spanned by these (d + 1) functions. 
By (2) and A1-A3, it follows that 


E[f(6:X) -E[f(r: )] = w(t) -v(60) -(E[S(r: 30]. 90) - (0). (5 


thus providing a minorizer function for the objective function 0 KH E[f(6; X)]. 
By A4, the usual MM algorithm would define iteratively the sequence 6,4; = 
ð (E [5 (On; X )] ). Since the expectation may not have closed form but infinite datasets 
are available (see A3), we propose a novel Online MM algorithm. It defines the 
sequence (5,, n 2 0) as follows: given positive step sizes {y,41, 2 1) in (0, 1) and 
an initial value so € S, set for n > 0: 


Sn+1 = Sn + Yn+1 {5 (855); Xn+1) is Sn} . (5) 


The update mechanism (5) is a Stochastic Approximation iteration, which defines 
an S-valued sequence (see A2). It consists of the construction of a sequence of 
minorizer functions through the definition of their parameter sn in the functional 
space spanned by —-v, $1... ., da. 

If our algorithm (5) converges, any limiting point s, satisfies E [5 (0(s,); X)| B 
Sx. Hence, our algorithm is designed to approximate the intractable expectation, 
evaluated at 6(s,), where s, satisfies a fixed point equation. The following lemma 
establishes the relation between the limiting points of (5) and the optimization prob- 
lem (1) at hand. Namely, it implies that any limiting value s, provides a stationary 
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point 0, := (sx) of the objective function E [f (0; X)] (i.e., 0, is a root of the 
derivative of the objective function). The proof follows the technique of [3]. Set 


h(s:-E[S(8(s);X)]-s T := {s € S : h(s) = 0}. 


Lemma 1 Assume that 6 — E[f(0; X)] is continuously differentiable on T and 
denote by L the set of its stationary points. If s, € V, then (sx) € L. Conversely, 
if 0x E JL, then Sy t= E [5 (64; X)] € I. 


Proof A4 implies that 
-Vy (0(s)) + V&(0(s))' s =0, s€S8. (6) 


Use (2) and Al, and apply the expectation w.r.t. X (under A3). This yields (4), 
which is available for any 6, t € T. This inequality provides a minorizer function for 
0  E[f(0; X)]: the difference is nonnegative and minimal (i.e. equal to zero) at 
0 = T. Under the assumptions and A1, this yields 


VE [f (+; X)] le» + V(t) - V6(7) E [S(z; X)] = 0. (7) 
Let s, € I and apply (7) with T — 6(s,). It then follows that 
VE [fC; X)] loz) + VU (8(54)) — VO(O(s%)) "5% = 0, 
which implies (sx) € £ by (6). Conversely, if 0, € £, then by (7), we have 
VU (Ox) — VAO) E [S(x; X)] = 0, 


which, by A3 and A4, implies that 0, = ô (E [5(0,; X)]) = (sx). By definition of 
Sx, this yields s, = E [S (0(s,); X)|; ie. sx € T. 


By applying the results of [5] regarding the asymptotic convergence of Stochastic 
Approximation algorithms, additional regularity assumptions on ¢, w, 6 imply that 
the algorithm (5) possesses a continuously differentiable Lyapunov function V de- 
fined on S and given by V : s 59 E [/(8(5); X)]. satisfying (VV(s),h(s)) x 0, 
where the inequality is strict outside the set I^ (see [3, Prop. 2]). In addition to 
Lemma 1, assumptions on the distribution of X and on the stability of the sequence 
(55, n 2 0) are provided in [5, Thm. 2 and Lem. 1], which, combined with the usual 
conditions on the step sizes: Y, y, = +œ and Y, y2 < co, yields the almost-sure 
convergence of the sequence (55, > 0} to the set I’, and the almost-sure conver- 
gence of the sequence (0(s,), n > 0) to the set £ of the stationary points of the 
objective function 0 + E [f (8; X)]. Due to the limited space, the exact statement 
of these convergence results for our Online MM framework is omitted. 
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3 Example Application 


As an example, we consider the logistic regression problem, where we solve (1) with 
f (0;x) := yw'8 -log[1eexp(w'60)), — x: (yw), 


where y € {0,1}, w € RP, and 0 € T := RP. Here, we assume that X = (Y, W) isa 
random variable such that E [ f (0; X)] exists for each 0. 

Denote by 4 the standard logistic function A (-) := exp {-} /(1+exp {-}). Following 
[1], (2) and A1 are verified by taking 


0 


YO = AOE | eon 


k S(t; x) = 


51 (T; x) 
vec (S2 (T; x)) 


where 
ç T 1 S. a 1 T 
Sı (T; x) := i» - (v w)}w+ ge” T, $2 (rix) = -guw ; 


With S := {(s1, vec (S2)) : s; € RP and S2 € R?*? is symmetric positive definite] , 
it follows that 6 (s) := —(255)-! s1. 

Online MM. Let s, = (51,4, S2,n) € S. The corresponding Online MM recursion 
is then 


= 1 2 
S1,n+1 = S1,n + Yn+1 (ra =A (855) Wns1) Wn+1 + gWr Wy, 8n) — Sin (8) 


1 
$2.4 = Son + Yn+1 -5 WaWa = $2.n , (9) 


where {(Yn+1, Wn+1), n = O} arei.i.d. pairs with the same distribution as X = (Y, W). 
Parameter estimates can then be deduced by setting 0,,; := Ospi1). 

For comparison, we also consider two Stochastic Approximation schemes directly 
on 6 in the parameter-space: a stochastic gradient (SG) algorithm and a Stochastic 
Newton Raphson (SNR) algorithm. 

Stochastic gradient. SG requires the gradient of f(0;x) with respect to 6: 
Vf (0;x) = (y — A(07 w)) w, which leads to the recursion 


Ên+1 = bn + Yn+1 [a = ACÔ Wri) } Wri: (10) 


Stochastic Newton-Raphson. In addition SNR requires the Hessian with respect 
to 6, given by V, f (6:x) = -A(0*w) {1 — A(@™w)} ww". The SNR recursion is then 


Ân = An + Yn+1 INS oT CGA: Xn+1) ai Ân} (11) 
Gua = -Âz (12) 
n+ = 0, + Yn Gn [sa = A(0, Wrsi)} Wi. (13) 
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Equation (12) assumes that Anti is invertible. In this logistic example, we can 
guarantee this by choosing Ao to be invertible. Otherwise Á, is invertible after 
some 7 sufficiently large, with probability one. Again in the logistic case, observe 
that, from the structure of XS a f and from the Woodbury matrix identity, Equations 
(11-12) can be replaced by 


G ee Gn Yn+1 AntiGnWnsiW Gn 
nel — 7 . 
l-ymai  l-Yyna (a — yn4) + YnsldnAWT,  GaWaai] 


where an+1 := A(O7 Wrst) {1 - A(8Ws.1)). 

It appears that the Online MM recursion in the s-space defined by (8) and (9) is 
equivalent to the SNR recursion above (i.e., (11)-(13)) when the Hessian V? ot (9; x) 
is replaced by the lower bound -iuw'. This observation holds whenever g is 
quadratic in (0 — T). 

Polyak averaging. In practice, for Online MM, SG, and SNR recursions, it is 
common to consider Polyak averaging [21], starting from some iteration no, chosen 


such as to avoid the initial highly volatile estimates. Set 04. := 0, and for n > no, 
05, = 84 + Gn- ny (On — 02), (14) 


where a; is usually set to a, := nl, 


Numerical illustration. We now demonstrate the performance of the Online 
MM algorithm for logistic regression — defined by (5) and the derivations above. To 
do so, a sequence (X; = (Yi, Wj) i € {1,...,max}} Of nmax = 10° i.i.d. replicates 
of X = (Y,W) is simulated: W = (1, U), where U ~ N (0,1) and [YIW = v] ~ 
Ber (A (65 w)), where 69 = (3, -3). Online MM is run using the learning rate y; = 
n 95, as suggested in [3]. The algorithm is initialized with 4) = (0,0) and so 
X2, S (0o; Xi) /2. 

For comparison, we also show, on Figure 1, the SG, SNR estimates and their 
Polyak averaged values in 0-space. As is usually recommended with Stochastic Ap- 
proximation, the first few volatile estimations are discarded. Similarly, for Polyak 
averaging, we set no = 10°. As expected, we observe that the Online MM and the 
SNR recursions are very close but with the SNR showing more variability. Their com- 
parison after Polyak averaging shows very close trajectories while the SG trajectory 
is clearly different and shows more bias. Final estimates [Polyak averaged estimates] 
of ĝo from the SG, SNR, and Online MM algorithms are respectively: (2.67, —2.66) 
[(2.51, —2.48)], (3.03, —3.03) [(2.99, —3.03)], and (3.01, —3.03) [(2.98, —3.02)], 
which we can compare to the batch maximum likelihood estimate (3.00, —3.05) 
(obtained via the glm function in R). Notice the remarkable closeness between the 
online MM and batch estimates. 
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Fig. 1 Logistic regression example: the first row shows Online MM (black), SG (blue), and SNR 
(red) recursions. The second row shows the respective Polyak averaging recursions. The estimates 
of the first 0 (first column) and the second (second column) components of 0 are plotted started 
from n = 10? for readability. 


4 Final Remarks 


Remark I For a parametric statistical model indexed by 06, let f(0;x) be the 
log-density of a random variable X with stochastic representation f (6;x) = 
log fy Po (x. y) u(dy), where po (x, y) is the joint density of (X, Y) with respect 
to the positive measure u for some latent variable Y € Y. Then, via [15, Sec. 4.2], 
we recover the Online EM algorithm by using the minorizer function g: 


piod Fi log po (x,y) pe (x,y) exp(—f(r3x)) (dy). 


Remark 2 Via the minorization approach of [1] (as used in Section 3) and the mixture 
representation from [19], we can construct an Online MM algorithm for MoE models, 
analogous to the MM algorithm of [20]. We shall provide exposition on such an 
algorithm in future work. 
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Detecting Differences in Italian Regional Health 
Services During Two Covid-19 Waves 


Lucio Palazzo and Riccardo Ievoli 


Abstract During the first two waves of Covid-19 pandemic, territorial healthcare sys- 
tems have been severely stressed in many countries. The availability (and complexity) 
of data requires proper comparisons for understanding differences in performance 
of health services. We apply a three-steps approach to compare the performance of 
Italian healthcare system at territorial level (NUTS 2 regions), considering daily time 
series regarding both intensive care units and ordinary hospitalizations of Covid-19 
patients. Changes between the two waves at a regional level emerge from the main 
results, allowing to map the pressure on territorial health services. 


Keywords: regional healthcare, time series, multidimensional scaling, cluster anal- 
ysis, trimmed k-means 


1 Introduction 


During the Covid-19 pandemic, the evaluation of similarities and differences between 
territorial health services [23] is relevant for decision makers and should guide the 
governance of countries [15] through the so-called “waves”. This type of analysis 
becomes even more crucial in countries where the National healthcare system is 
regionally-based, which is the case of Italy (or Spain) among others. Italy is one of 
the countries in Europe which has been mostly affected by the pandemic, and the 
pressure on Regional Health Services (RHS) has been producing dramatic effects 
also in the economic [2] and the social [3] spheres. Regional Covid-19-related health 
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indicators are extremely relevant for monitoring the pandemic territorial widespread 
[21], and to impose (or relax) restrictions in accordance with the level of health risk. 

The aim of this work is to exploit the potential of Multidimensional Scaling (MDS) 
to detect the main imbalances occurred in the RHS, observing the hospital admission 
dynamics of patients with Covid-19 disease. Both daily time series regarding patients 
treated in Intensive Care (IC) units and individuals hospitalized in other hospital 
wards are used to evaluate and compare the reaction to healthcare pressure in 21 
geographical areas (NUTS 2 Italian regions), considering the first two waves [4] of 
pandemic. Indeed, territorial imbalances in terms of RHS’ performance [24] should 
be firstly driven by the geographical propagation flows of the virus (first wave). Then, 
different reactions to pandemic shock may be provided by RHSs, and changes of 
imbalances can be observed in the second wave. 

Our proposal consists of three subsequent steps. Firstly, a matrix of distances 
between regional time series through a dissimilarity metric [29] is obtained. There- 
fore, we apply a (weighted) MDS [19, 22] to map similarity patterns in a reduced 
space, adding also a weighting scheme considering the number of neighbouring 
regions. Finally, we perform a cluster analysis to identify groups according to RHS 
performance in the two waves. 

The paper is organized as follows: Section 2 describes the methodological ap- 
proach used to compare and cluster time series, while Section 3 introduces data and 
descriptive analysis. Results regarding RHSs are depicted and discussed in Section 
4, while Section 5 concludes with some remarks and possible advances. 


2 Time Series Clustering 


Given a matrix T x n, where T represents the days and n the number of regions, our 
methodological approach consists of three subsequently steps: 


Step l. Compute a dissimilarity matrix D based on a given measure; 

Step 2. Apply a weighted multidimensional scaling (WMDS) procedure, storing 
the coordinates of the first two components; 

Step 3. Perform cluster analysis on the MDS reduced space to identify groups 
between the n regions. 


In the first step, a dissimilarity measure is computed for each pair of regional time 
series. The objective is to obtain a dissimilarity matrix D (with elements d;,;) for 
estimating synthetic measures of the differences between regions. There are different 
alternatives to compare time series, some comprehensive overviews are in [29, 13]. 

A reasonable choice is the the Fourier dissimilarity dr (x, y), which applies the 
n-point Discrete Fourier Transform [1] on two time series, allowing to compare the 
similarity between two time sequences after converting them into a combination of 
structural elements, such as trend and/or cycle. 
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In the second step, we implement a multidimensional scaling [31]. Due to its 
flexibility, MDS has been introduced also in time series analysis [25] and recently 
applied to different topics [30, 9, 16]. 

Since our aim is to take into account the degree of proximity between regions, we 
also employ a weighted multidimensional scaling technique (wMDS) [17, 14]. The 
L2 norm is multiplied by a set of weights œw = (w1,...,Wn) such that high weights 
have a stronger influence on the result than low weights. 

The reduced space generated by MDS can be used as starting point for subsequent 
analyses. Then, a cluster algorithm can be performed on the coordinates (of the 
reduced space) of MDS [18]. Different procedures should be suitable to perform a 
cluster analysis on the wMDS coordinates map. For an overview of modern clustering 
techniques in time series, see e.g. [26]. 

In our case, both the geographical spread of the pandemic and population density 
can determine remarkable differences in terms of hospitalization rates [12]. To 
mitigate the risk of regional outliers in the data, generating potential spurious clusters, 
we employ the trimmed k-means algorithm [8, 11]. A relevant topic in cluster analysis 
is related to the choice of the k number of groups. Our strategy is purely data-driven 
and it is based on the minimization of the within-cluster variance. 


3 Data and Descriptive Statistics 


Daily regional time series reporting a) the number of patients treated in IC units 
and b) the number of patients admitted in the other hospital wards are retrieved 
through the official website of Italian Civil Protection!. All patients were positive 
for the Covid-19 test (nasal and oropharyngeal swab). To take into account the 
different sizes in terms of inhabitants, both a) and b) are normalized according to the 
population of each territorial unit (estimated at 2020/01/01). The rates of patients 
treated in IC units and hospitalized (HO) patients in other hospital wards, are then 
multiplied by 100,000. 
The whole dataset contains two identified waves? of Covid-19, as follows: 


Wave 1 (W1): T = 109 days from February 24 to June 11, 2020 
Wave 2 (W2): T = 109 days from September 14 to December 31, 2020 


The date/trend may also depend on external factors, such as the implementation of 
restrictive measures introduced by the Italian Government [27, 6], which influenced 
the observed differences between W1 and W2. We have to remark that a full national 
lockdown was held between March 9th and May 18th 2020. 

Figure 1 shows the time series for HO and IC (rows), according to the two waves 
of Covid-19 (columns). The anomaly of the small Italian region (Valle D'Aosta) 
emerges both in the first (in particular concerning IC) and second waves (also for 


1 Source: www.dati-covid.italia.it 
? Refer to [7] for further details. 
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Fig. 1 Time series distributions of Italian regions. 


HO), while Lombardia, which is the largest and most populous region, dominates 
other territories especially when considering HO of W1. The upper panel of Figure 
1 helps to understand differences between the two waves in terms of admission to 
intensive cares: while regions with high, medium and low IC rate can be directly 
identified through the eyeball of the series during W1, in W2 more homogeneity 
is observed. Furthermore, with the exception of Valle D'Aosta, the IC rate remains 
always less than 10 for all considered observations. 

For what concerns HO rate, (lower panels of Figure 1), Lombardia reaches values 
greater than 100 in W1 (especially in April), while during W2 this threshold had 
exceeding by Valle D'Aosta and Piemonte (both in November). Again, if W1 opposes 
regions with high and (moderately) low HO rates, in W2 the following situation 
arises: a) Valle D'Aosta and Piemonte reach values over 100, b) four regions (Liguria, 
Lazio, P.A. Trento and P.A. Bolzano) present values over 75, and c) the majority of 
territories share similar trends with peaks always lower than 75. 


4 Grouping Regions by Clustering and Discussion 


In order to confirm and deepen the descriptive results of Section 3, we perform a 
cluster analysis following the scheme proposed in Section 2. We compute wMDS 
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equipped with the Fourier distance, using a set of weights w proportional to the 
number of neighbourhoods for each region, ensuring a spatial feature into the model. 

Figure 2 displays the main results of wMDS, distinguishing between four levels 
of critical issues experienced by the RHS. Outlying performances are coloured in 
Violet. A first cluster (in Red) includes "critical" regions while a group depicted in 
Orange contains territories with high pressure in their RHS. Regions involved in the 
Green cluster experimented a moderate pressure on RHS, while colour Blue indi- 
cates territories suffering from a low pressure. These clusters may also be interpreted 
as a ranking of the health service risk. 

As regards the HO during W1, leaving apart the two outliers (Lombardia and P.A. 
Bolzano) the “red” cluster is composed by three Northern regions (Piemonte, Valle 
d'Aosta and Emilia-Romagna). The group of high pressure is composed by Liguria, 
Marche and P.A. Trento, while the green cluster involves Lazio, Abruzzo and Toscana 
(from the centre of Italy) and Veneto. The last group includes nine regions, 7 of which 
are located in the southern Italy. In W2 the clustering procedure Piemonte and Valle 
d'Aosta are identified as outliers, while the high-pressure group is composed by two 
autonomous provinces (Trendo and Bolzano), Lombardia and Liguria. The “orange” 
group is constituted by regions located in the North-East (Friuli-Venezia Giulia, 
Emilia-Romagna and Veneto), along with Abruzzo and Lazio. Southern regions 
are allocated in the "green" coloured group (together with Umbria, Toscana and 
Marche), while Molise, Calabria and Basilicata remain in the low-pressure cluster. 

Regarding IC rates, during W1 Lombardia and Valle d'Aosta are considered 
as outliers while the “red” cluster is composed by four northern Italian regions 
(Emilia-Romagna, P.A. of Trento, Piemonte and Liguria), and Marche (located in 
the centre). The “orange” cluster contains Toscana, Veneto and P.A. Bolzano, while 
the moderate-pressure cluster involves three areas of centre Italy (Lazio and Umbria), 
among with the Friuli-Venezia Giulia (from the north-east) and Abruzzo. The last 
cluster includes only regions from the south. According to the bottom right panel of 
Figure 2, apart from Valle D'Aosta, the procedure identifies Calabria as an outlier. 
The "red" group acquires two observations from the Centre of Italy such as Toscana 
and Umbria, while the majority of regions are classified in the moderately pressured 
group. Only three Southern Italian areas are allocated in the last group (in green). 

If the geography of the disease appears fundamental in W1, especially regarding 
adjoining territories of Lombardia, in W2 this effect is less evident. Thus, regions 
improving (e.g. Emilia-Romagna) or worsening (such as Lazio and Abruzzo) their 
clustering “ranking” can be easily observed. As mentioned, the differences of re- 
strictive measures imposed by the Government in the two periods may have a role 
on these results. 


3 We remark that other distance measures have been applied. Moreover, a) the Fourier one shows 
better performance in terms of goodness of fit; b) the results are not sensitive with respect to the 
choice of distance. 
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Fig. 2 Map of the identified regional clusters. 


T T T T T 
SE 10°E 12°E 14°E 16°E 18°E 


SE 


T T T T T 
10°E 12°E 14°E 16°E 18°E 


utlier 


Detecting Differences in Italian RHS During Covid-19 Waves 2779 


5 Concluding Remarks 


The Covid-19 pandemic has put a strain on the Italian healthcare system. The reac- 
tions of RHS play a relevant role to mitigate the health crisis at territorial level and 
to guarantee an equitable access to healthcare. 

This work helps to understand similarities and divergences between the Italian re- 
gions in relation to the health pressure of the first two waves of the virus. Considering 
crucial measures such as HO and IC rates, the comparison between two waves allows 
to understand differences in the reactions to pandemic shocks of RHS. Although the 
northern Italy represented the epicentre of the Covid-19 spread in the first wave, 
some regions (e.g. Veneto and Friuli-Venezia Giulia) seem to have succeeded in 
avoiding hospitals overcrowding, while Southern regions (and Islands) definitively 
suffered from less pressure. Furthermore, in the second wave, the difference appears 
slightly smoothed and the cluster sizes seem more homogeneous. Moreover, there 
are some exceptions, such as the Emilia-Romagna, which seems to have been less 
affected by the second wave, compared to the other regions. The detection of clusters 
represents a starting point for the improvement of health governance and can be used 
to monitor potential imbalances in future unfortunate waves. 

Further analysis may employ other dedicated indicators coming, for instance, 
from the Italian National Institute of Statistics^, or using different proposals for com- 
bining wMDS with dissimilarity measures and clustering [28]. Following a different 
methodological approach, the recent method proposed in [10] should be applied on 
those data to include more complex spatial relationships between territories. 
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Political and Religion Attitudes in Greece: 
Behavioral Discourses 


Georgia Panagiotidou and Theodore Chadjipadelis 


Abstract The research presented in this paper attempts to explore the relationship be- 
tween religious and political attitudes. More specifically we investigate how religious 
behavior, in terms of belief intensity and practice frequency, is related to specific 
patterns of political behavior such as ideology, understanding democracy and his set 
of moral values. The analysis is based on the use of multivariable methods and more 
specifically Hierarchical Cluster Analysis and Multiple Correspondence Analysis in 
two steps. The findings are based on a survey implemented in 2019 on a sample of 
506 respondents in the wider area of Thessaloniki, Greece. The aim of the research is 
to highlight the role of people's religious practice intensity in shaping their political 
views by displaying the profiles resulting from the analysis and linking individual 
religious and political characteristics as measured with various variables. The final 
output of the analysis is a map where all variable categories are visualized, bringing 
forward models of political behavior as associated together with other factors such 
as religion, moral values and democratic attitudes. 


Keywords: political behavior, religion, democracy, multivariate methods, data anal- 
ysis 


1 Introduction 


In this research we present the analysis results of a survey, which was implemented 
in April 2019 to 506 respondents in Thessaloniki, focusing on their religious profile 
as well as their political attitudes, their moral profile and the way they comprehend 
democracy. The aim of the analysis is to investigate and highlight the role of religious 
practice in shaping political behavior. In the political behavior analysis field, religion 
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and more specifically church practice has emerged as one of the main pillars that form 
the political attitudes of voters. Religious habits seem to have a decisive influence 
on electoral choices, as derives from Lazarsfeld's research at Columbia University 
in 1944 [3], followed by the work of Butler and Stokes in 1969 [1] and the research 
of Michelat and Simon in France [6]. More specifically in the comparative study 
of Rose in 1974 [9], it turns out that the more religious voters appear to be more 
conservative by choosing to place themselves on the right side of the ideological 
"left-right" axis, while the non-religious voters opt for the left political parties. 
The research and analysis of Michelat and Simon [6] brings to the surface two 
opposing cultural models: on the one hand we have the deeply religious voters, who 
belong to the middle and upper classes, residing in the cities or in the countryside, 
while on the other hand we have the non-religious left voters with working class 
characteristics. The first framework is articulated around religion and those who 
belong to it identifying themselves as religious people, is inspired by a conservative 
value system, put before the value of the individual, the family, the ancestral heritage 
and tradition. The second cultural context is articulated around class rivalries and 
socio-economic realities; those who belong to this context identify themselves as 
"us workers towards others". They believe in the values of collective action, vote 
for left-wing parties, participate actively in unions and defend the interests of the 
working class. To measure the influence of religious practice on political behavior, 
applied research uses measurement scales about the intensity of religious beliefs and 
the frequency of church service practice as an indicator of the level of one's religious 
integration. 

To measure religious intensity level, variables are used such as how often they go 
to the service, how much do they believe in the existence of God, of afterlife, in the 
dogmas of the church and so on. Since the 90's there is a rapid decline in the frequency 
with which the population attends church service or self-identifies strongly in terms of 
religiousness. Nevertheless, the strong correlation between electoral preference and 
religious practice remains strong [5]. The most significant change for non-religious 
people is that the left is losing its universal influence as many of these voters expand 
also to the center. Strongly religious people continue to support the right more and, in 
some cases, strengthen the far right. In this paper, apart from attempting to explore 
and verify the existing literature over the effect of religion on political behavior, 
focusing on the Greek case, the approach exploits methods used to achieve the 
visualization of all existing relationships between different sets of variables. To link 
together numerous variables and their categories to construct a model of religious and 
political behavior, multiple applications of Hierarchical Cluster analysis (HCA) are 
being made followed by Multiple Correspondence Analysis (MCA) for the emerging 
clusters. In this way, a semantic map is constructed [7], which visualizes discourses 
of political and religious behavior and the inner antagonisms between the behavioral 
profiles. 
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2 Methodology 


For the implementation of the research a poll was conducted on a random sample 
of 506 people in the greater area of Thessaloniki in Greece, during April 2019. 
A questionnaire was used as a research tool which was distributed with an on-site 
approach of the random respondents. The questionnaire consisted of three sections: 
a) the first section included seven questions for demographic data of the respondent 
such as gender, age, educational level, marital status, household income, occupation 
and social class to which the respondent considers belonging; b) the second part 
contained seven questions, ordinal variables, related to the religious practice and 
beliefs of the respondent: i) how often does one go to church? ii) how often does one 
pray? iii) how close does one feel to God, Virgin Mary (or to another seven religious 
concepts) during church service? iv) how strongly does one have seven different 
feelings during church service? v) does one believe or not in the saints, miracles, 
prophecies (and another six religious concepts)? Two more questions investigating 
their profile in terms of what is taught in the Christian dogma were included vi) 
one asking if one can progress only by being an ethical person and vii) another one 
asking if they agree on the pain/righteousness scheme, that is if one suffers in this 
life will be rewarded later or in the afterlife; c) questions concerning the political 
profile of the respondent are developed in the third part of the questionnaire: i) 
one's self-positioning on the ideological left-right axis, ii) a set of nine ordinal 
variables requiring one's agreement or disagreement level on sentences that reflect 
the dimensions of liberalism-authoritarianism and left-right iii) this last section 
also includes two different sets of pictures, used as symbolic representation for the 
"democratic self” and the “moral self" [4]. The first set of twelve pictures represent 
various conceptualizations of democracy, and one is asked to select three pictures 
that represent democracy. The second set of pictures represent moral values in 
life, and one is asked to choose three pictures that represent one's set of personal 
values. Variables are ordinal, using a five-point Likert scale, apart from the question 
regarding whether one believes or not in prophecies magic etc. and the two last 
questions with the pictures, where we are using a binary scale of yes-no or zero-one 
where zero is for a non selected picture and one is for a selected picture. 

Data analysis was implemented with the use of M.A.D software (Méthodes 
d'Analyse des Données), developed by Professor Dimitris Karapistolis (more about 
M.A.D software at www.pylimad.gr).Firstly, Hierarchical Cluster Analysis (HCA) 
using chi-quare distance and Ward's linkage, assigns subjects into distinct groups 
based on their response patterns. This first step produces a cluster membership vari- 
able, assigning each subject into a group. In addition to this, the behavior typology of 
each group is examined, seeing the connection of each variable level to each cluster 
using two proportion z test (significance level set at 0.05) between respondents be- 
longing to cluster i and those who do not belong in cluster i for a variable level. The 
number of clusters is determined by using the empirical criterion of the change in the 
ratio of between-cluster inertia to total inertia, when moving from a partition with r 
clusters to a partition with r — 1 clusters [8]. In the second step of the analysis, the 
cluster membership variable is analyzed together with the existing variables using 
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MCA on the Burt table [2]. All associations among the variable categories are given 
on a set of orthogonal axes, with the least possible loss of the original information 
of the original Burt table. Next, we apply HCA for the coordinates of variable cat- 
egories on the total number of dimensions of the reduced space resulting from the 
MCA. In this way we cluster the variable, as previously we clustered the subjects. 
By clustering the variable response categories, we detect the various discourses of 
behavior, where each cluster of categories stands as a behavioral profile linked with 
a set of responses and characteristics. To produce the final output, the semantic map, 
we created a table including the output variables of the questionnaire, including de- 
mographics and variables for political behavior. Using the same two-step procedure 
using HCA and MCA for this final table, the semantic map is constructed, positioning 
the variable categories on a bi plot created by the two first dimensions of MCA. 


3 Results 


In the first step of the analysis, we apply HCA for each set of variables in each 
question. In the question: *How close do you feel during the service 1-To God, 2-To 
the Virgin, 3-To Christ, 4-To some Saint, Angel, 5-To the other churchgoers, 6-To 
Paradise, 7-To Hell, 8-To the divine service, 9-To his preaching priest", we get four 
clusters (Figure 1). 


Cluster Responses related to the cluster 96 

e19837 | "not at all" in everything 7,996 
e19882 | "enough" in 1,2,3 / "little" or "not at all" in 5,6,9 | 55,196 
e19883 | "a little" in 1,2,3 / "not at all" in 4,5,6,8,9 19,596 


e19884 | "absolutely" in everything and "enough" in 5,6,9 | 17,596 


Fig. 1 Four clusters on how close the respondents feel during church service. 


For the question: “How strongly you feel after the end of the service 1-The Grace 
of God in me, 2-Power of the soul, 3-Forgiveness for those who have hurt me, 4- 
Forgiveness for my sins, 5-Peace, 6-Relief it is over", we get six clusters (Figure 
2). 


Cluster Responses related to the cluster 96 

e21902 | in everything "absolutely" 9,096 
e21904 | "absolutely" peace, strength of soul / "not at all" forgiveness, relief | 23,496 
e21905 | in all "absolutely" / "not at all" relief 11,896 
e21906 | "quite" relief / in all others "a little" 16,896 
e21907 | in everything "not at all" 5,996 
e21908 | in all “enough” 33,096 


Fig. 2 Six clusters on how the respondents feel at the end of church service. 
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Five clusters (Figure 3) for the question: “Do you believe in 1-Bad (magic influ- 
ence) 2-Magic? 3- Destiny? 4-Miracles? 5-Prophecies of the Saints? 6- Do you have 
pictures of holy figures in your house? 7-in your workplace? 8-Do you have a family 
Saint?". 


Cluster | Responses related to the cluster % 

e22877 | yes to miracles and images 23,8% 
e22872 | yes to miracles, prophecies and pictures 12,0% 
e22874 | not at all 8,4% 
e22875 | yes in bad influence, magic, miracles, prophecies and pictures 17,4% 
e22879 | yes to all 37,8% 


Fig. 3 Five clusters on the beliefs of the respondents on various aspects of the Christian faith. 


Six clusters are detected (Figure 4) for the question: “How do you feel when you 
come face to face with a religious image 1-Peace, 2-Awe, 3-The presence of God, 
4-Emotion, 5-The need to pray, 6-Contact with the person in the picture". 


Cluster Responses related to the cluster 96 

e23856 | in everything "not at all" 5,196 
e23887 | inall other “moderately” (a little in awe, emotion / enough in prayer) 16,996 
e23890 "not at all" in prayer and person in the picture / in everything else "a 9,896 

little" 

e23892 | in everything "absolutely" 15,396 
e23893 | "not at all" in awe / in everything else "a little" 12,496 
e23894 | in everything "enough" 40,496 


Fig. 4 Six clusters on how the respondents feel when facing a religious image. 


We proceed with the clustering of the replies on political views and we get seven 
clusters of political profiles (Figure 5). 


Cluster Responses related to the cluster 96 

e29881 "strongly agrees" with drachma, individualism, anti-immigrant, anti- 7,8% 
EU, welfare state, not leader 

e29885 | “agrees” with welfare state agrees, “disagrees” with all the rest 8,2% 

e29886 | “agrees” with strong leader, tax cuts 27,6% 

e29887 | “disagrees” with the right to violence, “agrees” with all the rest 8,996 
"agrees" with drachma, individualism, anti-immigrants, welfare state, 

e29889 | not leader (difference with 881, here simply "agrees" and not 14,096 
interested in EU) 

e29890 | "agrees" with drachma, "disagrees" with all the rest 11,496 

e29891 "agrees" with tax cuts, drachma, anti-immigrant, anti-EU, 22,0% 
individualism, strong leader 


Fig. 5 Seven clusters according to the political views- profile of the respondents. 
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For the symbolic representation of the democratic self, when choosing three 
pictures that represent democracy for the respondent, we find eight clusters (Fig- 
ure 6), and eight clusters for the symbolic representation of the moral self for the 
respondents, as show in Figure 7. 


Cluster Responses related to the cluster 96 

e31892 | direct democracy, money, revolution, riot 5,496 
e31893 | parliament, money 2,496 
e31914 | direct democracy 11,696 
e31916 | parliament, council, church 10,996 
e31918 | protest, revolution 10,796 
e31920 | e-gov 14,296 
e31921 | protest, council, revolution 13,396 
e31924 | protest, ancient Greece, parliament, volunteering, church | 31,596 


Fig. 6 Eight clusters on how the respondents understand democracy. 


Cluster Responses related to the cluster 96 

e30970 | Christ, intimacy, volunteering, family 24,996 
e30953 | fun, intimacy, meditation, win, rebellion 2,296 
e30958 | Christ, family, army 13,796 
e30960 | meditation, win 7,6% 
e30961 | fun, career, intimacy, money 7,4% 
e30972 | career, win, fun, career 17,2% 
e30966 | career, peace, family 9,4% 
e30968 | Christ, peace, family 17,6% 


Fig. 7 Eight clusters on the different sets of moral values of the respondents. 


In the second step of the analysis, we jointly process the cluster membership 
variables. MCA produces the coefficients of each variable category which are now 
positioned in a two-dimensional map as seen in Figure 9. HCA is then applied again 
to the coefficients of the items, which bring forward three main clusters, modeling 
political and religious behavior. In Figure 8, Cluster 77 is connected to centre and 
moderate religious behaviour, cluster 78 reflects the voters of the right, with strong 
religious habits and beliefs, individualistic attitudes and more authoritarian and 
nationalistic political views, whereas cluster 79 represents the leftists, non-religious 
voters, closer to revolutionary political views and collective goods. Examining the 
antagonisms on the behavioral map (Figure 9), the first horizontal axis which explains 
22.8% of the total inertia, is created by the antithesis between right political ideology 
- strong religious behavior and left political ideology-no religious behavior (cluster 
78 opposite to cluster 79). The second axis (vertical) accounts for 7% of the inertia, 
and is explained as the opposition between the center (moderate religious behavior) 
against the left and right (cluster 77 opposite to both clusters 78 and 79). 
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Ethical person | Enough, a little Absolutely Not at all 

ram / A little / moderately Enough / Very / Absolutely Not at all 

Righteousness: 

Ideology Centre Right Left 
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Fig. 8 Three main behavioral discourses linking all variable categories together. 
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Fig. 9 The semantic map visualizing the behavioral profiles of voters, and the inner antagonisms. 
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4 Discussion 


The analysis uncovers the strong existing relationship between religious habits and 
political views, for the Greek case. The semantic map indicates two main antagonistic 
cultural discourses, including both religious, political and moral characteristics: The 
first discourse (cluster 77) is described as moderately religious practice and beliefs, 
connected to the ideological center. These voters have political attitudes that belong 
to the space between the center-left and the center-right. They understand democracy 
as a connection to money, direct democracy and electronic democracy. Their moral 
set of values is naturalistic and individualistic. The next behavioral discourse (cluster 
78) describes the voters of right ideology, with strong religious beliefs andfrequent 
religious practice. They appear as very ethical and believe in the concept of pain 
and righteousness. Regarding their political attitudes these more religious voters 
are against violence, have more authoritarian and nationalistic positions. They view 
democracy as parliamentary, representative, ancient Greece but also as church, while 
their moral set of values appear clearly naturalistic, Christian and nationalistic. 

Cluster 79 reflects the exact opposite discourse compared to 78. These voters 
belong to the left ideology and are non-religious. They do not adopt the ideas of 
the ethical person, or the scheme of pain and righteousness as mentioned in the 
Christian dogma. In terms of political attitudes, they are pro-welfare state. These 
non-religious and left voters understand democracy as direct with the need for 
revolution, protest and riot and support collective goods. Interpreting further the 
antagonisms as visualized on the semantic map, the main competition exists between 
the "right political ideology - strong religious behavior individualism" discourse 
and the "left political ideology-no religious behavior collectivism" discourse. A 
secondary opposition is found between the “center ideology- moderate religious 
behavior" discourse against the left and right extreme positions. 
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Supervised Classification via Neural Networks 
for Replicated Point Patterns 


Katerina Pawlasová, Iva Karafiátová, and Jiří Dvorak 


Abstract A spatial point pattern is a collection of points observed in a bounded 
region of R^, d > 2. Individual points represent, e.g., observed locations of cell 
nuclei in a tissue (d — 2) or centers of undesirable air bubbles in industrial materials 
(d = 3). The main goal of this paper is to show the possibility of solving the su- 
pervised classification task for point patterns via neural networks with general input 
space. To predict the class membership for a newly observed pattern, we compute 
an empirical estimate of a selected functional characteristic (e. g., the pair correla- 
tion function). Then, we consider this estimated function to be a functional variable 
that enters the input layer of the network. A short simulation example illustrates 
the performance of the proposed classifier in the situation where the observed pat- 
terns are generated from two models with different spatial interactions. In addition, 
the proposed classifier is compared with convolutional neural networks (with point 
patterns represented by binary images) and kernel regression. Kernel regression 
classifiers for point patterns have been studied in our previous work, and we consider 
them a benchmark in this setting. 
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1 Introduction 


Spatial point processes have recently received increasing attention in a broad range 
of scientific disciplines, including biology, statistical physics, or material science 
[9]. They are used to model the locations of objects or events randomly occurring 
in R, d > 2. We distinguish between the stochastic model (point process) and its 
realization observed in a bounded observation window (point pattern). 

Typically, analyzing spatial point pattern data means working with just one pat- 
tern, which comes from a specific physical measurement. In this paper, we take 
another perspective: we suppose that a collection of patterns, which are independent 
realizations of some underlying stochastic models, is to be analyzed simultaneously. 
These independent realizations are then referred to as replicated point patterns. 
Recently, this type of data has become more frequent, encouraging the adaptation 
of methods such as supervised classification to the point pattern setting. 

Since we are talking about supervised classification, our task is to predict the la- 
bel variable (indicating class membership) for a newly observed point pattern, using 
the knowledge about a sample collection of patterns with known labels (training 
data). In the literature, this problem has been studied to a limited extent. Properties 
of a classifier constructed specifically for the situation where the observed patterns 
were generated by inhomogeneous Poisson point processes with different intensity 
functions are discussed in [5]. However, this method is based on the special proper- 
ties of the Poisson point process, and its use is thus limited to a small class of models. 
On the other hand, no assumptions about the underlying stochastic models are made 
in [12], where the task for replicated point patterns is transformed, with the help 
of multidimensional scaling [16], to the classification task in R2. In [10, 11], the ker- 
nel regression classifier for functional data [4] is adapted for replicated point patterns. 
Instead of classifying the patterns themselves, a selected functional characteristic 
(e.g. the pair correlation function) is estimated for each pattern. These estimated 
values are considered functional observations, and the classification if performed 
in the context of functional data. The idea of linking point patterns to functional 
data also appears in [12] — the dissimilarity matrix needed for the multidimensional 
scaling is based on the same type of dissimilarity measure that is used for the ker- 
nel regression classifier in [10, 11]. Finally, [17] briefly discusses the model-based 
supervised classification. Unsupervised classification is explored in [2]. 

In this paper, our goal is to discuss the use of classifiers based on artificial neu- 
ral networks in the context of replicated point patterns. We pay special attention 
to the procedure described in [14], where both functional and scalar observations 
enter the input layer. Hence, similarly as in [10, 11], each pattern can be represented 
by estimated values of a selected functional characteristic and the classification is per- 
formed in the context of functional data. The resulting decision about class member- 
ship is based on the spatial properties of the observed patterns that can be described 
by the selected characteristic. Therefore, with a thoughtfully chosen characteristic, 
this method has great potential within a wide range of possible classification scenar- 
ios. Moreover, it can be used without assuming stationarity of the underlying point 
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processes, and it can be easily extended to more complicated settings (e.g., point 
patterns in non-Euclidean spaces or realizations of random sets). 

We present a short simulation experiment that illustrates the behaviour of the neu- 
ral network described in [14]. Binary classification is performed on realizations 
of two different point process models — the Thomas process (model for attractive 
interactions among pairs of points) and the Poisson point process (benchmark model 
for no interactions among points). This approach is then compared to the classifica- 
tion based on convolutional neural networks (CNNs) [8], where each pattern enters 
the network as a binary image. Finally, both methods based on artificial neural net- 
works are compared to the kernel regression classifier studied in [10, 11] which can 
be considered a benchmark in the context of replicated point patterns. 

This paper is organized as follows. Section 2 provides a brief theoretical back- 
ground on spatial point processes and their functional characteristics, including 
the definition of the pair correlation function, which plays a crucial role in the se- 
quel. Section 3 summarizes the methodology introduced in [14] about neural network 
models with general input space. Section 4 is devoted to a short simulation example. 


2 Point Processes and Point Patterns 


This section presents the necessary definitions from the point process theory. Our ex- 
position closely follows the book [13]. For detailed explanation of the theoretical 
foundations, see, e.g., [7]. Throughout the paper, a simple point process X is defined 
as a random locally finite subset of R, d > 2, where each point x € X corresponds 
to a specific object or event occurring at the location x € R7. In applications, X can 
be used as a mathematical tool to model random locations of cell nuclei in a tissue 
(with d — 2) or centers of undesirable air bubbles in industrial materials (d — 3). 
We distinguish between the mathematical model X, which is called a point process, 
and its observed realization X, which is called a point pattern. Examples of four 
different point patterns are given in Figure 1. 

Before proper definition of the pair correlation function, a functional characteristic 
that plays a key role in the sequel, we need to define some moment properties of X. 
The intensity function A(-) is a non-negative measurable function on R? such that 
A(x) dx corresponds to the probability of observing a point of X in a neighborhood 
of x with an infinitesimally small area dx. If X is stationary (its distribution is 
translation invariant in R^), then A(-) = A is a constant function and the constant 4 is 
called the intensity of X. In this case, A is interpreted as the expected number of points 
of X that occur in a set with unit d-dimensional volume. Similarly, the second-order 
product density AO (- ,-) is a non-negative measurable function on R? x R? such 
that A (x, y) dx dy corresponds to the probability of observing two points of X 
that occur jointly at the neighborhoods of x and y with infinitesimally small areas 
dx and dy. 

Assuming the existence of A and A), the pair correlation function g(x, y) is de- 
fined as AO (x, y)/(A(x)A(y)), for A(x)A(y) > 0. If A(x) = 0 or A(y) = 0, we set 
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g(x,y) = 0. We write g(x,y) = g(x — y) when g is translation invariant and 
g(x,y) = g(llx — yl) when g is also isotropic (invariant under rotations around 
the origin). For the Poisson point process, a model for complete spatial randomness, 
AO (x, y) = A(x)A(y) and g = 1. Thus, g(x, y) quantifies how likely it is to observe 
two points in X jointly occurring in infinitesimally small neighbourhoods of x and y, 
relative to the "no interactions" benchmark. 

A large variety of characteristics (both functional and numerical) have been de- 
veloped to capture various hypotheses about the stochastic models that generated 
the observed point patterns at hand. We have focused on the pair correlation function 
g mainly because of its widespread use in practical applications and ease of interpre- 
tation. Other popular characteristics are based on g, e.g., its cumulative counterpart, 
traditionally called the K -function. Others are based on inter-point distances, such as 
the nearest-neighbor distance distribution function G and the spherical contact dis- 
tribution function F. A comprehensive summary of commonly used characteristics, 
including the list of possible empirical estimators, is presented in [9, 13]. Estimators 
of g, K, G, and F are implemented in the R package spatstat [3]. 


3 Neural Networks with General Input Space 


This section prepares the theoretical background for the supervised classification 
of replicated point patterns via artificial neural networks. The recent approach 
of [14, 15] is the cornerstone of our proposed classifier, and hence we focus on 
its description in the following paragraphs. On the other hand, the approach based 
on CNNs is more established in the literature. We use it primarily for comparison 
and thus we refer the reader to [8] for a detailed description. 

Following the setup in [14], let us assume that we want to build a neural network 
such that it takes K € N functional variables and J € N scalar variables as input. 
In detail, suppose that we have fk : Tk — R, k = 1,2,...,K (Tk are possibly 
different intervals in R), and p? € R, j = 1,2,...,J. Furthermore, suppose that 
the first layer of the network contains n, € N neurons. We then want the i-th neuron 
of the first layer to transfer the value 


K J 

(2. (1) (1) (1) T 

Dag 2; f Buts ars ul 4b), iz12,..um 
- 2 


where p € R is the bias and g : R — R is the activation function. Two 
types of weights appear in the formula: the functional weights {Bik : Tk — R}, 
and the scalar weights (wf; ) ; BY }. The optimal value of all these weights should 
be found during the training of the network. To overcome the difficulty of find- 
ing the optimal weight functions ik, we can express jy as a linear combina- 
tion of ¢1,...,¢m,, where $1,..., Øm, are the basis functions (from the Fourier 


or B-spline basis) and m; is chosen by the user. The sum Shy Sa B(Dik fx (t) dt can 
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Fig. 1 Theoretical values of the pair correlation function g for the Poisson point process 
and the Thomas process with different values of the model parameter c. For these models, g is trans- 
lation invariant and isotropic. A single realization of the Poisson point process and the Thomas 
process with parameter o set to 0.1, 0.05 and 0.02 respectively, is illustrated in the right part 
of the figure. 


be expressed as Y, L, Cik Sa Q1 (t) fx (t) dt, where the integrals Sa ilt) fk (t) dt 
can be calculated a priori and the coefficients of the linear combination of the basis 
functions {c;;;,} act as scalar weights of the first layer and are learned by the network. 
The scalar values m ,i-1,...,nj,then propagate through the next fully connected 
layers as usual. An in-depth analysis of the computational point of view is provided 
in [14]. In the software R, neural networks with general input space are covered by 
the package FuncNN [15] built over the packages keras [6] and tensorflow [1]. 
The last two packages are used to handle CNNs. 


4 Simulation Example 


This section presents a simple simulation experiment in which we illustrate the per- 
formance of the classification rule based on the neural network with general input 
space. Binary classification is considered, where the group membership indicates 
whether a point pattern was generated by a stationary Poisson point process or a sta- 
tionary Thomas process, the latter exhibiting attractive interactions among pairs 
of points [13]. The sample realizations can be seen in Figure 1. 

We consider the Thomas process to be a model with one parameter c. Small values 
of ø indicates strong, attractive short-range interactions between points, while larger 
values of o result in looser clusters of points. Attractive interactions between the 
points of a Thomas process result in the values of the pair correlation function being 
greater than the constant 1, which corresponds to the Poisson case. The effect of 
o on the shape of the theoretical pair correlation function of the Thomas process 
(which is translation invariant and isotropic) is illustrated in Figure 1. 
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Since the model parameter o affects the strength and range of attractive interac- 
tions between points of the Thomas process, the complexity of the binary classifica- 
tion task described above increases with increasing values of o [10, 11]. Therefore, 
this experiment focuses on the situation where o is set to 0.1, and all realizations 
are observed on the unit square [0, 1]?. We fix the intensity of the two models to 400 
(in spatial statistics, patterns with several hundreds of points are standard nowadays). 
In this framework, we expect the classification task to be challenging enough to ob- 
serve differences in the performance of the considered classifiers. On the other hand, 
itis still reasonable to distinguish (w.r.t. the chosen observation window) the realiza- 
tions of the model with attractive interactions from the realizations corresponding 
to the complete spatial randomness. 

Two different collections of labelled point patterns are considered as training sets. 
The first, referred to as Training data 1, is composed of 1000 patterns per group. 
The second, called Training data 2, is then composed of 100 patterns per group. 
The test and validation sets have the same size and composition as the Training 
data 2. Table 1 presents the accuracy of three classification rules (described below) 
with respect to the test set. For the first two rules, the accuracy is in fact averaged 
over five runs corresponding to different settings of initial weights in the underlying 
neural network. Concerning the network architecture, we fix the ReLU function to be 
the activation function for all layers, except the output one. The output layer consists 
of one neuron with sigmoid activation function. The loss function is the binary 
cross-entropy. A detailed description of the individual layers is given below. 

Rule 1 is based on the neural network with general input space. We set K and 
J from Sect. 3 to be 1 and 0, respectively, and r; = (0, 0.25). The value 0.25 is 
related to the observation window of the point patterns at hand being [0, 1]*. Then, 
fi is the vector of the estimated values of the pair correlation function g (estimated 
by the function pcf.ppp from the package spatstat [3] with default settings but 
the option divisor set to d), considered as a functional observation. Furthermore, 
we set m, — 29, and consider the Fourier basis. The data preparation (estimation of g, 
computation of integrals from Sect. 3) takes 740 s of elapsed time (w.r.t. the Training 
data 1, on a standard personal computer). To tune the hyperparameters of the final 
neural network (number of hidden layers, number of neurons per hidden layers, 
dropout, etc.), we performed a rough grid search (models with various combinations 
of the hyperparameters were trained on Training data 1 and we used the loss function 
and the accuracy computed on the validation set to compare the performances). 
The resulting network consists of one hidden layer with 128 neurons followed by 
a dropout layer with a rate of 0.3. We use the Adam optimizer, and the learning rate is 
decaying exponentially, with initial value 0.001 and decay parameter 0.05. In total, 
the network has 3 969 trainable parameters. To train the network, we perform 50 
epochs with an average elapsed time of 200 ms per epoch (w.r.t. Training data 1). 

Rule 2 uses CNNs. Similarly to the previous case, our decision about the network 
architecture is based on a rough grid search. The final network has two convolutional 
layers, each of them with 8 filters, a squared kernel matrix with 36 (first layer) or 16 
rows (second layer), and a following average pooling layer with the pool size fixed 
at 2 x 2. We add a dropout layer after the pooling, with a rate of 0.3 (after the first 
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Table 1 Accuracy for the three presented classification rules w.r.t. the testing set. For Rule 1 
and Rule 2, the accuracy is averaged over five runs corresponding to five different choices of initial 
weights in the underlying neural networks. In addition, the standard deviation computed from 
the five accuracy values is reported. Values close to 1 indicate a nearly perfect classification. 


Rule 1 Rule 2 Rule 3 
Training data 1 0.947 «0.003 0.934 +0.032 0.935 
Training data 2 0.895 «0.010 0.512 +0.028 0.925 


pooling) and 0.2 (after the second pooling). The batch size is set to 32. We use 
the Adam optimizer, and the learning rate is decaying exponentially, with initial 
value 0.001 and decay parameter 0.1. The total number of trainable parameters is 
equal to 32 785 and we perform 50 epochs with the average elapsed time per epoch 
(w.r.t. Training data 1) equal to 930 s. Data preparation (converting point patterns 
to binary images) takes less than 10 s of the elapsed time (w.r.t. Training data 1). 

Rule 3 is the kernel regression classifier studied in [10, 11]. We use the Epanech- 
nikov kernel together with an automatic procedure for the selection of the smoothing 
parameter. The underlying dissimilarity measure for point patterns is constructed 
as the integrated squared difference of the corresponding estimates of the pair cor- 
relation function g; for more details, see [10]. The elapsed time needed to compute 
the upper triangle of the dissimilarity matrix (containing dissimilarities between 
every pair of patterns from Training data 1) is equal to 390s. To predict the class 
membership for the testing set (w.r.t. Training data 1), 206 s elapsed. During the clas- 
sification procedure, no random initialization of any weights is needed. Thus, there 
is no reason to average the accuracy in Table 1 over multiple runs. 

For Training data 1, Table 1 shows that the highest accuracy was achieved for the 
neural network with general input space. The standard deviation of the five different 
accuracy values is significantly higher for CNN which has almost ten times more 
trainable parameters than the network with general input space. For Training data 
2, the kernel regression method achieved the highest accuracy. In this situation, 
the performance of the classifier is stable even in the case of small training data. 
For the first two rules, the neural network models chosen with the help of the grid 
search (where the networks were trained w.r.t. the bigger training set) are now 
trained w.r.t. the smaller training set. The resulting accuracy is still around 0.90 
for the network with general input space, but it drops to 0.5 (random assignment 
of labels) for CNN. The size of Training data 2 seems to be too small to successfully 
optimize the large amount of trainable parameters of the convolutional network. 

To conclude, our simulation example suggests that the classifier based on CNN 
(using information about the precise configuration of points) is in the presented sit- 
uation outperformed by the classifiers based on the estimated values of the pair cor- 
relation function (using information about the interactions between pairs of points). 
The high number of trainable parameters of the CNN makes its use rather demanding 
with respect to computational time. The approach based on neural networks with 
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general input space proved to be competitive with or even outperform the current 
benchmark method (kernel regression classifier), especially for large datasets. Also, 
it has the lowest demands regarding computational time. In the case of a small 
dataset, the low number of hyperparameters speaks in favor of kernel regression. 
Finally, in the simple classification scenario that we have presented, the choice 
of the pair correlation function was adequate. In practical applications, a problem- 
specific characteristic should be constructed to achieve satisfactory performance. 
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Parsimonious Mixtures of Seemingly Unrelated 
Contaminated Normal Regression Models 


Gabriele Perrone and Gabriele Soffritti 


Abstract In recent years, the research into linear multivariate regression based on 
finite mixture models has been intense. With such an approach, it is possible to 
perform regression analysis for a multivariate response by taking account of the 
possible presence of several unknown latent homogeneous groups, each of which is 
characterised by a different linear regression model. For a continuous multivariate 
response, mixtures of normal regression models are usually employed. However, in 
real data, it is not unusual to observe mildly atypical observations that can negatively 
affect the estimation of the regression parameters under a normal distribution in 
each mixture component. Furthermore, in some fields of research, a multivariate 
regression model with a different vector of covariates for each response should be 
specified, based on some prior information to be conveyed in the analysis. To take 
account of all these aspects, mixtures of contaminated seemingly unrelated normal 
regression models have been recently developed. A further extension of such an 
approach is presented here so as to ensure parsimony, which is obtained by imposing 
constraints on the group-covariance matrices of the responses. A description of the 
resulting parsimonious mixtures of seemingly unrelated contaminated regression 
models is provided together with the results of a numerical study based on the 
analysis of a real dataset, which illustrates their practical usefulness. 
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1 Introduction 


Seemingly unrelated (SU) regression equations are usually employed in a multivari- 
ate regression analysis whenever the dependence of a vector Y = (Yj,...,Yy)' of 
M continuous variables on a vector X = (Xj,..., Xp)' of P regressors has to be 
modelled by allowing the error terms in the different equations to be correlated and, 
thus, the regression parameters of the M equations have to be jointly estimated [14]. 
With such an approach, the researcher is also enabled to convey prior information 
on the phenomenon under study into the specification of the regression equations 
by defining a different vector of regressors for each dependent variable. This latter 
feature is particularly useful in any situation in which different regressors are ex- 
pected to be relevant in the prediction of different responses, such as in [3, 6, 16]. 
This approach has been recently embedded into the framework of Gaussian mixture 
models, leading to multivariate SU normal regression mixtures [7]. In these models, 
the effect of the regressors on the dependent variables changes with some unknown 
latent sub-populations composing the population that has generated the sample of 
observations to be analysed. Thus, when the sample is characterised by unobserved 
heterogeneity, model-based cluster analysis is simultaneously carried out. 

Another source of complexity which could affect the data and make the prediction 
of Y a difficult task to perform is represented by mildly atypical observations [13]. 
Robust methods of parameter estimation insensitive to the presence of such obser- 
vations in a sample characterised by unobserved heterogeneity have been introduced 
in [9], where the conditional distribution Y|X = x is modelled through a mixture of 
K multivariate contaminated normal models, where K is the number of the latent 
sub-populations. A limitation associated with these latter models is that the same 
vector of regressors has to be specified for the prediction of all the dependent vari- 
ables. To overcome this limitation while preserving all the features mentioned above, 
a more flexible approach which employs mixtures of multivariate SU contaminated 
normal regression models has been recently introduced in [11]. These latter models 
are able to capture the linear effects of the regressors on the dependent variables 
from sample observations coming from heterogeneous populations. The researcher 
is also enabled to specify a different vector of regressors for each dependent variable. 
Finally, a robust estimation of the regression parameters and the detection of mild 
outliers in the data are ensured. 

In the presence of many responses and many latent sub-populations, analyses 
based on these latter models can become unfeasible in practical applications because 
of a large number of model parameters. In order to keep this number as low as 
possible, an approach due to [4], based on the spectral decompositions of the K 
covariance matrices of Y|X = x, is exploited here so as to obtain fourteen different 
covariance structures. The resulting parsimonious mixtures of SU contaminated 
regression models are described in Section 2. The usefulness of these new models is 
illustrated through a study aiming at determining the effect of prices and promotional 
activities on sales of canned tuna in the US market. A summary of the obtained results 
is provided in Section 3. 
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2 Parsimonious SU Contaminated Normal Regression Mixtures 


In a system of M SU regression equations for modelling the linear dependence of 
Y on X, let Xm = (Xm), Xm- - ., Xmp„) be the P,,-dimensional sub-vector of X 
composed of the Pm regressors expected to be relevant for the explanation of Yin, 
for m = 1,..., M. Furthermore, let X7, = (1, X7,)'. The mixture of K SU normal 
regression models described in [7] can be defined as follows: 


X" pj +e, €~ Ny (05, X1) with probability zt, 


Y= (1) 


X" Bi. +e, €~ Ny (05, X) with probability mx, 


where zx is the prior probability of the kth latent sub-population, with 2, > 0 for 
ks i KS an 7, = 1; X* is the following (P* + M) x M partitioned matrix: 


X! Opa -0p 
£ Opi Xj 505a 


>. 
Opry +1 Opi... Xu 


with 0p, .; denoting the (Pm + 1)-dimensional null vector; P* = pea Pans P. - 
(Bii Bus Bua) is the (P* + M)-dimensional vector containing all the 
linear effects on the M responses in the kth latent sub-population, with B% = 
(Bok,ms Bgm) > form = 1,...,M;e-(en..., em) is the vector of the errors, which 
are supposed to be independent and identically distributed; Nm (0m, £4) denotes 
the M-dimensional normal distribution with mean vector 04; and positive-definite 
covariance matrix 3. From now on, this mixture regression model is denoted as 
MSUN. When X,, = X Vm (the P regressors are employed in all the M equations), 
model (1) reduces to the mixtures of K normal (MN) regression models (see [8]). 

When the data are contaminated by the presence of mild outliers, departures from 
the normal distribution could be observed within any of the K latent sub-populations. 
A model able to manage this situation has been recently introduced in [11]. It 
has been obtained from equation (1) by replacing the normal distribution with 
the contaminated normal distribution. Under this latter distribution, the probability 
density function (p.d.f.) of e within the kth sub-population is equal to A (e; 84) = 
ad (60r Ex) + (1 — ax)dm Cei Or rj Ex). where by (=; u, E) denotes the 
p.d.f. of the distribution Nm (Ow, Xx), æg € (0.5, 1) and 7, > 1 are the proportion 
of typical observations within the kth sub-population and a parameter that inflates the 
elements of X;, respectively, and 9, = (oj, Nk, X). As a consequence, a mixture 
of K SU contaminated normal (MSUCN) regression models is given by: 


xX” Bi +e, €~ CNy (1,1, 0m, X1) with probability 71, 
Yu Q) 
X" B, +e, €~ CNy(a&, nk, Üy , Xx) with probability 7x, 
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where CNy (a, Nk, Om, Xx) denotes the M-dimensional contaminated normal dis- 
tribution described by the p.d.f. ^ (e; x). The parameter vector of model (2) is 


V = (Wy... gessi Kr), where Wy = (1K, 0%), Ox = (B5, Ok). The number of 
free elements of y is ny = 3K —1+K(P* + M) + nc, where ng denotes the total 


number of free variances and covariances, with ng = Kny and ny, = E When 
Xm = X Vm, model (2) coincides with the mixture of K contaminated normal (MCN) 
regression models described in [9]. For a, — 1 or y — 1 Vk, model (2) reduces 
to model (1). Conditions ensuring identifiability of models (2) are provided in [11]. 
The ML estimation of y in equation (2) can be carried out by means of a sample 
S = ((xi, yi)... (x7, y1 )} of independent observations drawn from model (2) and 
an expectation-conditional maximisation (ECM) algorithm [10]. Details about this 
algorithm, including strategies for the initialisation of V and convergence criteria, are 
illustrated in [11]. In practical applications, the value of K is generally unknown and 
has to be properly chosen. This task can be carried out by resorting to model selec- 
tion criteria, such as the Bayesian information criterion [15]: BIC = 2£() — ny In, 
where y is the maximum likelihood estimator of y. Another commonly used in- 
formation criterion is the integrated completed likelihood [2], which admits two 
slightly different formulations: ICL; = BIC + 2G 3E MAP(2;&) In 2;y and 
ICL = BIC +2 by Ed Zi In ĉik, where 2;, is the estimated posterior probabil- 
ity that the ith sample observation come from the kth sub-population (for further 
details see [11]), MAP(Zi&) = 1 if max,{2;;,} occurs when h = k (MAP(£j) = 0 
otherwise). Whenever the specification of the subvectors Xm, m = 1,...,M, to be 
considered in the M equations of the multivariate regression model is questionable, 
such criteria can also be employed to perform subset selection. 

As the number of free parameters ny incresases quadratically with M, analyses 
based on model (2) can become unfeasible in real applications. A way to man- 
age this problem can be based on the introduction of suitable constraints on the 
elements of Xi, k = 1,...,K, based on the following eigen-decomposition [4]: 
LE = AxD&A&D,;, where Aj = [2] M. Ax is a diagonal matrix with entries 
(sorted in decreasing order) proportional to the eigenvalues of X, (with the con- 
straint |A| = 1) and D, is a M x M orthogonal matrix of the eigenvectors of X; 
(ordered according to the eigenvalues). This decomposition allows to obtain vari- 
ances and covariances in X4 from 4%, Ag and Dz. From a geometrical point of view, 
Ax determines the volume, A, the shape and Dx, the orientation of the kth cluster of 
sample observations detected by the fitted model. By constraining 4x, A; and D; to 
be equal or variable across the K clusters, a class of fourteen mixtures of K SUCN 
regression models is obtained (see Table 1). With variable volumes, shapes and ori- 
entations (VVV in Table 1), the resulting model coincides with (2). When K » 1, the 
other covariance structures allow to obtain thirteen different parsimonious mixtures 
of K SUCN regression models (i.e.: with a reduced ng). When K = 1, the possible 
covariance structures for X, are: diagonal with different entries, diagonal with the 
same entries and fully unconstrained. The ML estimation of y under model (2) with 
any of these parameterisations can be carried out through an ECM algorithm in 
which the CM-step update for &; can be computed either in closed form or using 
iterative procedures, depending on the parameterisation to be employed (see [4]). 
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Table 1 Features of the parameterisations for the covariance matrices Xx, k = 1,..., K (K > 1). 
Acronym Covariance structure Volume Shape Orientation CM step Ng 
EEE ADAD’ Equal Equal Equal Closed ny 
VVV ADK AD), Variable Variable Variable Closed Kny 
EII al Equal Spherical = Closed 1 
VII Axl Variable Spherical = Closed K 
EEI AA Equal Equal Axis-aligned Closed M 
VEI AKA Variable Equal  Axis-aligned Iterative M+K-1 
EVI AAx Equal Variable Axis-aligned Closed MK-(K-1 
VVI AkAx Variable Variable Axis-aligned Closed MK 
EEV AD; AD}. Equal Equal Variable Iterative Kny - (K — 1)M 
VEV AD, AD; Variable Equal Variable Iterative Kny — (K — 1)(M - 1) 
EVE ADA; D’ Equal Variable Equal Iterative ng (K -1)(M -1) 
VVE Ak DA, D' Variable Variable Equal Iterative ny -(K-1)M 
VEE Ax DAD’ Variable Equal Equal Iterative ny — (K - 1) 
EVV AD; AD, Equal Variable Variable Iterative Kny - (K - 1) 


3 Analysis of U.S. Canned Tuna Sales 


The models illustrated in Section 2 have been fitted to a dataset [5] containing the 
volume of sales (Move), a measures of the display activity (Nsale) and the log price 
(Lprice) for seven of the top 10 U.S. brands in the canned tuna product category in 
the J = 338 weeks between September 1989 and May 1997. The goal of the analysis 
is to study the dependence of canned tuna sales on prices and promotional activites 
for two products: Star Kist 6 oz. (SK) and Bumble Bee Solid 6.12 oz. (BBS). To this 
end, the following vectors have been considered: Y’ = (Y; =Lmove SK, Y; =Lmove 
BBS), X’ = (X; =Nsale SK, Xo = Lprice SK, X5 = Nsale BBS, X4 = Lprice 
BBS), where Lmove denotes the logarithm of Move. The analysis has been carried 
out using all the parameterisations of the MSUN, MN, MCSUN and MCN models 
for each K € {1,2,3,4,5,6}. Furthermore, MSUN and MCSUN models have been 
fitted by considering all possible subvectors of X as vectors Xm, m = 1,2, for each 
K. In this way, best subset selections for Lmove SK and Lmove BBS have been 
included in the analysis both with and without contamination. The overall number of 
fitted models is 37376, including the fully unconstrained models (i.e., with the VVV 
parameterisation) previously employed in [11] to perform the same analysis. 

Table 2 reports some information about the nine models which best fit the analysed 
dataset according to the three model selection criteria over the six examined values 
of K within each model class. An analysis based on a single linear regression model 
(K = 1), both with and without contamination, appears to be inadequate according to 
all criteria. All the examined criteria indicate that the overall best model for studying 
the effect of prices and promotional activities on sales of SK and BBS tuna is a 
parsimonious mixture of two SU contaminated Gaussian linear regression models 
with the EVE parameterisation for the covariance matrices in which the log unit sales 
of SK tuna are regressed on the log prices and the promotional activites of the same 
brand, while the regressors selected for the BBS log unit sales are the log prices of 
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both brands and the promotional activites of BBS. Thus, the analysis suggests that 
two sources of complexity affect the analysed dataset: unobserved heterogeneity over 
time (K = 2 clusters of weeks have been detected) and the presence of mildly atypical 
observations. Since the two estimated proportions of typical observations are quite 
similar (see the values of ĉ in Table 3), contamination seems to characterise the 
two clusters of weeks detected by the model almost in the same way. As far as the 
strength of the contaminating effects on the conditional variances and covariances 
of Y|X = x is concerned, it appears to be stronger in the first cluster, where the 
estimated inflation parameter is larger (77; = 15.70). By focusing the attention on the 
other estimates, it appears that also some of the estimated regression coefficients, 
variances and covariances are affected by heterogeneity over time. Sales of SK tuna 
results to be negatively affected by prices and positively affected by promotional 
activites of the same brand within both clusters detected by the model, but with 
effects which are sligthly stronger in the first cluster of weeks. A similar behavior is 
detected for the estimated regression equation for Lnove BBS, which also highlights 
that Lmove BBS are positively affected by the log prices of SK tuna, especially in 
the first cluster of weeks. Furthermore, typical weeks in the first cluster show values 
of Lmove SK which are more homogeneous than those of Lmove BBC; the opposite 
holds true for the typical weeks belonging to the second cluster. Also the correlation 
between log sales of SK and BBS products results to be affected by heterogeneity 
over time: while in the largest cluster of weeks this correlation has been estimated 
to be slightly positive (0.200), the first cluster is characterised by a mild estimated 
negative correlation (—0.151). An interesting feature of this latter cluster is that 17 
out of the 20 weeks which have been assigned to this cluster are consecutive from 
week no. 58 to week no. 74, which correspond to the period from mid-October 1990 
to mid-February 1991 characterised by a worldwide boycott campaign encouraging 
consumers not to buy Bumble Bee tuna because Bumble Bee was found to be buying 
yellow-fin tuna caught by dolphin-unsafe techniques [1]. Such events could represent 
one of the sources of the unobserved heterogeneity detected by the model. According 
to the overall best model, some weeks have beed detected to be mild outliers. In the 
first cluster, this has happened for week no. 60 (immediately after Halloween 1990) 
and week no. 73 (two weeks immediately before Presidents day 1999). The analysis 
of the estimated sample residuals y; — ft, (xi; Bi) for the 20 weeks belonging to the 
first cluster (see the scatterplot on the left side of Figure 1) clearly show that weeks 
60 and 73 noticeably deviates from the other weeks. Among the 318 weeks of the 
second cluster, 32 have resulted to be mild outliers, most of which are associated 
with holidays and special events that took place between September 1989 and mid- 
October 1990 or between mid-February and May 1997 (see the scatterplot on the 
right side of Figure 1). These results are almost equal to those obtained using the best 
overall fully unconstrained fitted model in the analysis presented in [11]. However, 
the EVE parameterisation for the MSUCN model has allowed to obtain a better trade- 
off among the fit, the model complexity and the uncertainty of the estimated partition 
of the weeks; furthermore, it has led to a slightly lower number of mild outliers in 
the second cluster of weeks. 
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Table 2 Maximised log-likelihood £ (pb) and values of BIC, ICL, and IC Ly for nine models 
selected from the classes MSUCN, MCN, MSUN and MN in the analysis of tuna sales. 


Model class K Acronym Xi X5 Ww) ny BIC ICL, ICh 
MSUCN 2 EVE Xı, X2 X2, X3, X4 -242.9 23 —619.8 —625.7 —635.8 
MCN 2 EVI X X —239.6 28 —642.2 —648.9 —663.2 
MCN 2 EEV X X —240.8 29 —650.6 —650.8 —652.0 
MCN 3 EVI Xj, X2, X4 Xj, X2, X4 —214.2 36 —638.0 —703.1 —788.6 
MSUN 2 VEV Xj, X» X3,X4  —2719.3 18 —663.4 —673.1 —692.1 
MSUN 3 EEV Xo,X3 X2, X3, X4 —259.8 28 —682.7 —684.7 —688.0 
MSUN 5 VVV X», X3 Xı, X4  —167.4 49 —620.0 -701.1 -780.3 
MN 3 EEV X2, X3, X4 Xo, Xa, X4 —258.7 31 —697.9 —699.6 —702.1 
MN 4 VVE Xo, X4 X2,X4  —216.6 36 —642.9 —725.3 —832.9 


Table 3 Parameter estimates of the overall best model for the analysis of tuna sales. 


y k=1 k=2 

fk 0.062 0.938 

OK 0.810 0.844 

Ak 15.70 6.94 

Bu (8.87, 0.56, —4.70) (8.64, 0.27, —3.09) 

Bio (15.04, 3.92, 2.83, 17.76) (9.98, 0.25, 0.12, 3.83) 

f | 0.034 pa ter 030) 
—0.009 0.105 0.012 0.030 
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Fig. 1 Scatterplots of the estimated residuals for the weeks assigned to the first (left) and second 
(right) clusters detected by the overall best model. Points of the first scatterplot are labelled with 
the number of the corresponding weeks. Black circle and red triangle in the second scatterplot 
correspond to typical and outlying weeks, respectively. 


4 Conclusions 


The parsimonious mixtures of seemingly unrelated linear regression models for 
contaminated data introduced here can account for heterogeneous regression data 
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both in the presence of mild outliers and multivariate correlated dependent variables, 
each of which is regressed on a different vector of covariates. Models from this 
class allow for simultaneous robust clustering and detection of mild outliers in 
multivariate regression analysis. They encompass several other types of Gaussian 
mixture-based linear regression models previously proposed in the literature, such 
as the ones illustrated in [7, 8, 9], providing a robust and flexible tool for modelling 
data in practical applications where different regressors are considered to be relevant 
for the prediction of different dependent variables. Previous research (see [9, 11]) 
demonstrated that BIC and ICL could be effectively employed to select a proper 
value for K in the presence of mildly contaminated data. Thanks to an imposition of 
an eigen-decomposed structure on the K variance-covariance matrices of Y|X = x, 
the presented models are characterised by a reduced number of variance-covariance 
parameters to be included in the analysis, thus improving flexibility, usefulness and 
effectiveness of an approach to multivariate linear regression analysis based on finite 
Gaussian mixture models in real data applications. 
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Penalized Model-based Functional Clustering: a 
Regularization Approach via Shrinkage Methods 


Nicola Pronello, Rosaria Ignaccolo, Luigi Ippoliti, and Sara Fontanella 


Abstract With the advance of modern technology, and with data being recorded 
continuously, functional data analysis has gained a lot of popularity in recent years. 
Working in a mixture model-based framework, we develop a flexible functional 
clustering technique achieving dimensionality reduction schemes through a Lj pe- 
nalization. The proposed procedure results in an integrated modelling approach 
where shrinkage techniques are applied to enable sparse solutions in both the means 
and the covariance matrices of the mixture components, while preserving the under- 
lying clustering structure. This leads to an entirely data-driven methodology suitable 
for simultaneous dimensionality reduction and clustering. Preliminary experimental 
results, both from simulation and real data, show that the proposed methodology is 
worth considering within the framework of functional clustering. 
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1 Introduction 


In recent decades, technological innovations have produced data that are increasingly 
complex, high dimensional, and structured. A large amount of these data can be 
characterized as functions defined on some continuous domain and their statistical 
analysis has attracted the interest of many researchers. This surge of interests is 
explained by the ubiquitous examples of functional data that can be found in different 
application fields (see for example [2], and references therein for specific examples). 
With functions as the basic units of observation, the analysis of functional data 
poses significant theoretical and practical challenges to statisticians. Despite these 
difficulties, methodology for clustering functional data has advanced rapidly during 
the past years; recent surveys of functional data clustering are presented in [7] and 
[2]. Popular approaches have extended classical clustering concepts for vector-valued 
multivariate data to functional data. 

In this paper, we consider a finite mixture as a flexible model for clustering. 
In particular, applying a functional model-based clustering algorithm with an L1- 
penalty function on a set of projection coefficients, we extend the results of [8] 
and [9] for vector-valued multivariate data to a functional data framework. This 
approach appears particularly appealing in all cases in which the functions are 
spatially heterogeneous, meaning that some parts of the function can be smoother 
than in other parts, or that there may be distant parts of the function that are correlated 
with each other. Furthermore, the introduction of a shrinkage penalty allows to look 
for directions in the feature space (that is now the space of expansion/projection 
coefficients) that are the most useful in separating the underlying groups without 
first applying dimensionality reduction techniques. 

In Section 2 we present at first the methodology along with some details on model 
estimation (subsection 2.2). Secondly, in Section 3, we perform a validation study 
with simulated and real data for which the classes are known a-priori. 


2 Shrinkage Method for Model-based Clustering for Functional 
Data 


Here we consider the problem of clustering a set of n observed curves into K 
homogeneous groups (or clusters). To this end, we propose a flexible model based 
on a finite mixture of Gaussian distributions, with a L; penalized likelihood, which 
we name Penalized model-based Functional Clustering (PFC-L). 


2.1 Model Definition 


We consider a set of n observed curves, x1, . . . , Xn, that are independent realizations 
of a continuous stochastic process X = (X(1)j;c[o,rj taking values in L?[0, T]. In 
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practice, such curves/trajectories are available only at a discrete set of the domain 
points (fj; :i 2 1,...,n, S= 1,..., mj) and the n curves need to be reconstructed. 
To this goal, it is common to assume that the curves belong to a finite dimensional 
space spanned by a basis of functions, so that given a basis of functions ® = 
{W1, ....Wp} each curve x; (t) admits the following decomposition: 


p 
x2» BD.  i-l.am (2.1) 
j=l 
that is the stochastic process X admits a corresponding truncated basis expansion 
P 
X(t) = 9 BOW), 
j=l 


where B = {£1 (X),...,8,(X)} is a random vector in RP. By considering observa- 
tions with a sampling error, such that 


x2? (t) = x(t) + ei, ]21,..,5, (2.2) 


with e; ~ N(0, c2), the realizations of the random coefficients B; ; for j = 1,...,p 
describing each curve can be obtained via least squares as B; - (0,0;)-! 9; X?^* 
where ©; = (W;(tis)), 1 € j < p,1 € s € mj contains the basis functions evaluated 
at the fixed domain points and Xo - (x99* (ti), - S d (tim )) is the vector of 
observed values of the i-th curve. 

With the goal of dividing into K homogeneous groups the observed curves 
X1,...,Xa, let us assume that it exists an unobservable grouping variable Z = 
(Zi, ..., Zg) € [0,1]* indicating the cluster membership: z;,, = 1 if x; belongs to 
cluster k, 0 otherwise (and z; ; is indeed what we want to predict for each curve). 

In adopting a model-based clustering approach, we denote with zr; the (a-priori) 
probabilities of belonging to a group: 


zy = P(Zy = 1), k=1,...,K, 


such that SES zy = land zy, > 0 for each k, and we assume that, conditionally on 
Z, the random vector f follows a multivariate Gaussian distribution, that is for each 
cluster 


PI(Zk = 1) = By ~ N (Hg, Ex) 


where ug = (70 arene TA and X, are respectively the mean vector and 
the covariance matrix of the k-th group. Then the marginal distribution of B = 
{B1,..., Bp) can be written as a finite mixture with mixing proportions 7t; as 


K 
P(B) = 9 nf (By: Mes Ei). 
k=1 
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where f is the multivariate Gaussian density function. The log-likelihood function 
can then be written as 


n K 
((6; B) = 5 log 9 me f (By Hes Ei). 
kel 


i=l 


where 0 = (z1,..., TK; io... Hg; Xo... Ek) is the vector of parameters to be 
estimated and B; = (Bi ;,.... B pa^ is the vector of projection coefficients of the 
i-th curve. 


In this modeling framework, we consider a very general situation without intro- 
ducing any kind of constraints neither for cluster means nor for covariance matrices, 
that can be different in each cluster. This flexibility, however, leads to overparame- 
terization and, as an alternative to any kind of constraints, we consider a penalty that 
allows regularized parameters' estimation. 

To define a suitable penalty term, we follow the penalized approach introduced 
by Zhou et al. [8] in the high-dimensional setting, and so we consider a penalty 
composed by two terms: the first one on the mean vector of each cluster ug, and 
the second one on the inverse of the covariance matrix in each group Wg = E, 
otherwise said “precision” matrix, with elements W,.;,;. The proposed penalized 
log-likelihood function, given the projection coefficients f;, is 


, 


n K K K p 
Ip(8: B) =) log 9 nef (Bi ui Ex) -41 D Milli — 42 > >) Wes 
k=1 k=1 k=1 j 


i=] 1 j.l 


where ||u;||) = 2 [us j|, 41 > 0 and 4» > 0 are penalty parameters to be suitably 
chosen. 

The penalty term on the cluster mean vectors allow for component selection 
in the functional data framework (whereas it would be variable selection in the 
multivariate case), considering that when the j-th component in the basis expansion 
is not useful in separating groups it has a common mean across groups, that is 
Hj = ++. = HK,j = 0. Then to realize component selection the considered term is 
Dii all. 

The second part of the penalty, namely PN 1 xS ı Wii], imposes a shrinkage on 
the elements of the precision matrices, thus avoiding possible singularity problems 
and facilitating the estimation of large and sparse covariance matrices. 


2.2 Model Estimation via E-M Algorithm 


Since the membership of each observation to a cluster is unobservable, data related 
to the grouping variable Z is inevitably missing and the maximum penalized log- 
likelihood estimator can be obtained by means of the E-M algorithm [4], that iterates 
over two steps: expectation (E) of the complete data (penalized) log-likelihood by 
considering the unknown parameters equal to those obtained at the previous iteration 
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(with initialization values), and maximization (M) of a lower bound of the obtained 
expected value with respect to the unknown parameters. 

In particular, at the d-th iteration, given a current estimate g , the lower bound 
after the E-step assumes the following form: 


d ^ 
Op (0:0 =D, UL, Ty) [log me+log f (Biur Ex) |i Xia lulio Eia DP, Wezel 


where Tk, ; = P(Zy = 1|X = xj) is the posterior probability of observation i to belong 
to group k. The M-step maximizes the function Q p in order to update the estimate 
of 0. 

As suggested by [9], it is possible to maximize each of the K term us- 
ing a “graphical lasso” (GLASSO) algorithm (first proposed by [5]), thanks 
to the close connection between fitting Gaussian mixture models and Gaus- 
sian graphical models. Indeed, in GLASSO the objective function looks like 
log det(W) — tr(SW) — 4 bue |W; | so that the algorithm implemented in the R 
package “glasso” can be used with W = Wi, S = S, and A = mid for each k 

i=l Tk,i 


to obtain the elements We of the precision matrices. 


2.3 Model Selection via Silhouette Profile 


A fundamental, and probably unsolved, problem in cluster analysis is determining 
the "true" number of groups in a dataset. To this purpose, for simplicity, here we 
approach the problem choosing the number of groups as cluster validation problem 
and use the average silhouette width index as a model selection heuristic. The 
silhouette value for curve i is given by 


. _ b(i) - a(i) 
O= aD 


where a(i) is the average distance of curve i to all other curves h assigned to the 
same cluster (if i is the only observation in its cluster, then s(i) = 0), and b(i) is 
the minimum average distance of curve i to observations h which are assigned to 
a different cluster. This definition ensures that s(7) takes values in [—1, 1], where 
values close to one indicate “better” clustering solutions. Conditional on K and a pair 
of values (41, 42), we thus assess the overall cluster solution using the total average 


of silhouette values 
n 


1 
S(K, 41,42) = = i). 
(K, 11,4) = — 2, s) 
In particular, by doing a grid search for the triple (K, 41, 42), the best cluster 
solution is obtained by looking for the largest value of the average silhouette width 
(ASW) index. Note that, to evaluate s(7), i = 1,...,n, and then the objective function 


S(K, 41, 42), we need to compute a distance between pairs of curves X; and X}. One 
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possibility is to compute the euclidean distance 


dL(,h) = / lI; C) — Xy Clar. 


3 Experimental Results 
3.1 Simulation 


We present here a simulated scenario in order to investigate the effectiveness of 
the Lı regularization in removing noise while preserving dominant local features, 
accommodating for spatial heterogeneity of the curves. 

The statistical analysis is illustrated for data simulated by means of a finite mixture 
of multivariate Gaussian distributions. In particular, based on equation (2.1) and 
(2.2), the curves are simulated using a combination of p — 25 Fourier basis functions 
defined over a one-dimensional regular grid with 100 observations. We consider a 
mixture of four (K = 4) multivariate Gaussian distributions with isotropic covariance 
matrices, i.e. 


n ed N (ey x) where €j ^ N (0;0.5), k= 1, n ,4. 


With the exclusion of 3 entries per group, the means p, are all zero mean vectors. 
Under this scenario, the simulated curves (25 per group) and the non-zero group 
expansion coefficients are represented in Figure 1. For this simple simulation setting, 
estimation results suggest that, using euclidean distance to computed the ASW, the 
grid search procedure is always able to correctly select the cluster-relevant basis 
functions. This is confirmed by Figure 2 which shows both the distribution (over 100 
replications) of the selected basis functions and the data projected on these bases that 
clearly highlight the identification of 4 clusters. Under this scenario, the quality of 
the estimated clusters thus appears very good as the analysis of the misclassification 
rate suggests an 100% of accuracy in all the replicated datasets. 

Similar results hold for more complex simulation designs, where we consider 
different structure of the covariance matrices in the data generating process. 


3.2 Performance on Real Data Sets 


We evaluate the PFC-L; model on a well-known benchmark data set, namely the 
electrocardiogram (ECG) data set (data can be found at the UCR Time Series 
Classification Archive [3]). 

The ECG data set comprises a set of 200 electrocardiograms from 2 groups of 
patients, myocardial infarction and healthy, sampled at 96 time instants in time. 
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Fig. 1 Left: 25 simulated curves for each group. Right: Vector of expansion coefficients for each 
group, with only three non-zero coefficients corresponding to basis functions with specific period- 
icities (Hertz values). 
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Fig. 2 Left: Data projected on cluster specific functional subspace generated by the selected basis 
functions. Right: Distribution (over 100 replications) of the selected basis functions shown for pairs 
of sine and cosine basis functions, according to the Hertz values. 


This data set were previously used to compare the performance of several func- 
tional clustering models in [1]. The results in Table 5 of [1] show that the FunFEM 
models, compared to other state of the art methodologies, achieved the best perfor- 
mances in terms of accuracy. Hence, here, we limit the comparison to the results 
obtained with the PFC-L, and the FunFEM models. Although FunFEM models relay 
on a mixture of Gaussian distributions describing the likelihood of the data similarly 
to our proposal, they differ on facing the intrinsic high dimension of the problem 
by estimating a latent discriminant subspace in parallel with the steps of an EM 
algorithm. 

For all the data, we reconstruct the functional form from the sampled curves 
choosing arbitrarily 20 cubic spline basis of functions. We tested the PFC-L, models 
considering five different values for the number of clusters, K = {2, 3,4,5,6}, and 
six values for A; = (0.5, 1, 5, 10, 15, 20}. 

Considering that the GLASSO penalty parameter 2 depends linearly from 42, 
the choice of Az has to provide suitable values for 4. A practical approach is to 
choose values avoiding convergence problems with GLASSO. Here 4» was set to 
(5, 7.5, 10, 12, 15, 20} for the ECG data. Both PFC-L; and FunFEM algorithms were 
initialized using a K-means procedure. 
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The clustering accuracies, computed with respect to the known labels, are 69% for 
FunFEM DFM (a, ;¢,] (choosing among 12 different model parameterizations with 
BIC index), and 75% for PFC-L; [4| = 0.5 , 4» = 5] (values of tuning parameters 
chose by ASW index) . Thus PFC-Z, achieves good performance, with an increase 
in the accuracy about 9%. 


4 Discussion 


In this paper we tried to investigate the potential of shrinkage methods for clustering 
functional data. Our numerical examples show the advantages of performing clus- 
tering with features selection, such as uncover interesting structures underlying the 
data while preserving good clustering accuracy. To the best of our knowledge, this is 
the first proposal that considers a penalty for both means and covariances of mixture 
components in functional model-based clustering. In the model selection section we 
defined an heuristic criterion to choose among different model parameterizations 
based on average silhouette index. It may be interesting to evaluate different dis- 
tances (i.e. not euclidean) to compute this index in future research. Moreover, we 
will consider more complex simulation designs to investigate the robustness of the 
proposal and extend the comparison with the state of the art methodologies on more 
benchmark datasets. 
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Emotion Classification Based on Single 
Electrode Brain Data: Applications for Assistive 
Technology 


Duarte Rodrigues, Luis Paulo Reis, and Brígida Mónica Faria 


Abstract This research case focused on the development of an emotion classification 
system aimed to be integrated in projects committed to improve assistive technolo- 
gies. An experimental protocol was designed to acquire an electroencephalogram 
(EEG) signal that translated a certain emotional state. To trigger this stimulus, a set 
of clips were retrieved from an extensive database of pre-labeled videos. Then, the 
signals were properly processed, in order to extract valuable features and patterns 
to train the machine and deep learning models.There were suggested 3 hypotheses 
for classification: recognition of 6 core emotions; distinguishing between 2 different 
emotions and recognising if the individual was being directly stimulated or merely 
processing the emotion. Results showed that the first classification task was a chal- 
lenging one, because of sample size limitation. Nevertheless, good results were 
achieved in the second and third case scenarios (70% and 97% accuracy scores, 
respectively) through the application of a recurrent neural network. 


Keywords: emotions, brain-computer interface, EEG, supervised learning, machine 
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1 Introduction 


Emotions are a part of our lives, as humans we know how to identify the tiniest 
of microexpressions to unveil what someone is feeling, but also how to use them 
to express our hearts. From the youngest of ages we see and interact with others 
and build a database of patterns of, for example, what joy is and how different it is 
from fear or sadness. Computers, on the other hand, do not have any idea of what an 
emotion is or how to recognize it. Or do they? 

The Artificial Intelligence and Computer Science Laboratory (LIACC) estab- 
lished 2 projects where emotion recognition can be of the utmost importance. The 
first project, the "Intel Wheels 2.0" [1], intends to develop an interactive and in- 
telligent electric wheelchair. This innovative equipment will have a diverse set of 
features, such as an adaptive control system (through eye gaze, a brain-computer 
interface, hand orientation, among others) and a personalized multi-modal interface 
which will allow communication to multiple devices both from the patients and the 
caregivers. In this case, having information about the mood of the patient is very 
beneficial, because the interface can give updates to the nursing staff of the emotional 
condition of the patient. The second project, the "Sleep at the Wheel" [2], focuses on 
the research of an interface that can sense and predict a driver's drowsiness state, be- 
ing able to detect if he fell asleep while driving and, consequently, support an alarm 
system to provide safer routing and driving. Here the state of mind of the driver 
is a very important aspect, as different emotions, like anger or fear, can provoke 
dangerous situations or unpredictable scenarios, making the driver less attentive to 
his surroundings. 

In this work, emotions will be sensed through a brain-computer interface (BCI). 
These are commercial devices that allow to acquire a surface electroencephalo- 
gram (EEG). This signal is used to measure the electrical activity of the brain, that 
fluctuates according to the firing of the neurons in the brain, being quantified in 
micro-volts. In this research, the BCI used was the "NeuroSky MindWave2" which 
possesses one single electrode on the forehead, from which it collects a signal from 
the activity of the frontal lobe. This brain area is responsible for the higher executive 
functions, including emotional regulation, planning, reasoning and problem solving 
[3]. 

The study of emotion recognition started with psychologist Paul Ekman that 
defined, based on a cross cultural study, six core emotions - Fear, Anger, Happiness, 
Sadness, Surprise and Disgust [4]. Later, psychologist Robert Plutchik established a 
model called "Wheel of Emotions", a diagram where every emotion can be derived 
from the core 6. 

It is also important to have a way to measure what someone is feeling or what 
emotion they are experiencing. An easy way to do this is through the "Discrete Emo- 
tion Questionnaire", a psychological validated questionnaire to verify the intensity 
of a certain emotion. This assessment presents the 6 core emotions to the subjects 
asking them to rate the intensity they felt, from 1 to 7 [5]. 

As a first approach in this area, the current work aims to be able to identify the 
core emotions using EEG signals collected with the BCI. 
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2 Experimental Methodology 


In order to correctly identify the core emotions, the first step is to trigger them in 
an efficient way for the brain data collected to be as informative as possible.To do 
so, the emotions were prompted via a set of video clips, that lasted 5-7 seconds. 
These videos were selected from a certified database, where the videos were labeled 
according to the intensity and kind of emotion it caused in the subjects [6]. For each 
of the 6 core emotions, the 4 videos classified with the biggest intensity were selected 
to be presented to the participants of this research work. 

For each of the 24 video clips (4 videos per each of the 6 emotions), 3 EEG 
samples are collected. The first is before the display of the video, where a fixation 
cross is presented, in order to collect the idle/blank state of the user, where he 
is asked to relax. The second sample is the EEG during the video (active visual 
stimulus); and the third sample is after the video finishes where the volunteer is 
processing the emotion triggered (higher level thinking), while getting back to the 
initial relaxed state, where the fixation cross is presented again. To confirm that the 
volunteers experience the same emotion defined in the pre-determined label, they 
are a prompted to answer the Discrete Emotion Questionnaire", after the 3 EEG 
samples are collected. 

Regarding the physiological signal processing, this step is important because the 
raw EEG signal that comes directly from the BCI has a low signal-to-noise ratio, 
as well as many surrounding artifacts that contaminate the readings, especially eye 
blinks and facial movements triggered by the various emotions. These interfering 
signals caused by the latter, denominated electromyograms (EMG), are characterized 
by high frequencies (50-150 Hz) that make the underlying signal very noisy. Every 
time a person blinks, the EEG signal shows a very high peak with a very low 
frequency (« 1Hz). To remove these muscle artifacts, a 5" order utterworth bandpass 
filter (this type of filter was chosen because it has the flattest frequency response, 
which leads to less signal distortion) with cut-off frequencies in 1 Hz and 50 Hz 
[7].The attenuation of very low frequencies is important to remove the eye blinks 
artifacts. Considering the top cut-off frequency, it is very convenient to use 50 Hz 
since it mitigates the effects of the power line noise and the EMG artifacts. Like 
this, no important brain data is lost. At this step, the EEG was segmented in the 
brain waves of interest, i.e., the alpha and beta brain waves. The best way to perform 
this is to apply bandpass filters (same filter type as before) in the corresponding 
bandwidths, 8-13Hz and 13-32 Hz, to have alpha and beta bands, respectively. 

The EEG signals, at this stage possess the "emotional data" exposed allowing 
to extract the features. To do so, multiple mathematical equations were applied to 
obtain relevant information from the signals. Feature extraction methods depend 
on the domain, as will be seen ahead [8]. Most strategies to extract features from 
the EEG are formulas applied in the time domain, such as, the common statistical 
equations, the Hjorth statistical parameters, the mean and zero crossings (number of 
times the signal crosses these 2 thresholds) [8]. Besides these, there were applied 
more advanced feature extraction methods, based on fractal dimensions and entropy 
analysis (methods to assess the complexity, or irregularity, of a time-series) [9]. 
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Regarding frequency domain approaches, these features can only be calculated in 
the filtered EEG and not in the brain waves, as their spectrum is very narrow. In 
terms of the pure frequency band, the only feature computed was the Power Spectral 
Density (PSD), based on the Welch method. These domains can be combined creating 
the time-frequency domain, leading to more sophisticated methods, like the Hilbert 
— Huang Transform, where the original signal is decomposed in intrinsic mode 
functions (IMF) [10]. 

The resulting number of features is too high to compute machine learning models, 
because the correlation between most of the features is very low, which means that 
between different classes the information is virtually the same. This would introduce 
uncertainty in the weights for each class in the models, thus the number of features 
needs to be reduced. To do this the "Min Redundancy Max Relevance" (MRMR) 
method was applied, with the objective of finding the optimal number of features 
to have a higher inter-class variability, in order to find distinct patterns between 
emotions [11]. The features were used raw, normalized or standardized to train the 
models. 

In this study, all the models implemented are based on supervised learning and 
fully depend on the data that is inputted. Concerning emotion classification there is 
not a specific machine learning approach that is optimal, thus 9 different types of 
models were implemented to verify which has the best performance. These models 
are designed to be able to adapt to various kinds of input data, through the definition 
of hyper-parameters. Hence, to tune them to the best possible configuration, it was 
performed a GridSearchCV. This method exhaustively searches over a given list of 
possible parameters applying cross validation between them. In the end, the model 
with the best performance is chosen to be trained with the resulting feature matrix. 

A deep learning model was also implemented, based on recurrent neural network 
(RNN), a very common architecture in classification problems using EEG. A par- 
ticularity of this network is that it has a GRU, i.e., a layer that helps to mitigate the 
problem of vanishing gradients (common issue on artificial neural networks), giving 
long term memory to the model [12]. 


3 Evaluation and Discussion of Results 


In this experiment, 12 subjects volunteered to participate. Each EEG recording is 
labeled according to the emotion registered in the original database, as well as if it 
was before video, during or after the video. The answers of the *Discrete Emotion 
Questionnaire" were used to validate if the emotion triggered by the video was as 
expected and, if so, the data was used. With this dataset structure, 3 hypotheses were 
tested and their results are discussed ahead. 

An important aspect to have in consideration is that the EEG collected while the 
subject is relaxing, i.e., while the fixation cross presented before the video, does not 
have relevant cognitive information regarding emotions. Therefore, these segments 
were not considered to train any of the models. 
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3.1 Core Emotions Classification 


This first hypothesis describes the main goal of the project where a model was 
developed to classify 6 emotions. 

First, the feature extraction was computed. At this step, the optimal number of 
features to get selected was tested, iterating from 5 to 50, 5 at a time. The best 
number found was 30, which gave the best accuracies, with a balanced computation 
time and power. This value was chosen for the 3 feature matrixes (raw, normalized 
and standardized). The dataset was then divided into training and testing with an 
80% ratio and fully independent of one another. Each model was then trained and 
assessed, by computing the accuracy in the test dataset. Table 1 presents the results 
for each model. 


Table 1 Results of the 6 Core Emotions Classification. 


Classification Models Raw Features Normalized Features Standardized features 
Accuracy (%) 

Gaussian Naive Bayes Classifier 12.07 12.93 10.34 
Support Vector Classifier 12.07 12.93 16.38 
Decision Tree Classifier 18.96 18.10 18.10 
Random Forest Classifier 24.13 18.10 20.69 

K Nearest Neighbors 21.55 18.96 16.38 
Logistic Regression 25.00 14.66 18.10 
Linear Discriminant Analysis 24.13 14.65 18.96 

Linear Support Vector Classifier 18.10 13.79 19.82 

Multi-Layer Perceptron 20.69 13.79 12.93 
Recurrent Neutral Network 13.79 20.69 23.27 


When comparing the various models, the average accuracy is around 16-18%, 
logically due to the number of classes in the problem (100%/6 = 16,6%). Despite 
this, the best result reached was 25% accuracy, with the features in their raw state, 
since the magnitude information was not lost, so patterns in different emotions could 
be more easily identified due to the high discrepancy in the values. These results are 
not discouraging since the main objective of the study is very ambitious, as we are 
trying to create a model to define universally what an emotion is. There is no work 
more subjective or abstract, and the only way to achieve this universal standardization 
would be with a sample population as wide and diverse as possible with different 
beliefs, nationalities, age groups, etc. Although this is an initial study, it shows that 
it is possible to register and identify differences in the electrical changes of the 
prefrontal cortex and, with that information, categorize what someone is feeling. 
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3.2 One vs One - Dual Emotion Classification 


As the results in the previous hypothesis could not precisely identify an emotion when 
compared to the other 5, the problem was narrowed down and a new hypothesis was 
tested, to continue the proposed research. In this experiment, the model was trained 
to discern between only 2 emotions, decided a priori. For demonstration purposes, 
a concrete example can be seen in Table 2 where it compares "fear" vs "surprise". 


Table 2 Results of "Fear vs Surprise" Classification. 


Classification Models Raw Features Normalized Features Standardized features 
Accuracy (%) 

Gaussian Naive Bayes Classifier 48.27 55.17 53.44 
Support Vector Classifier 51.72 51.72 53.44 
Decision Tree Classifier 56.89 50.00 44.83 
Random Forest Classifier 48.27 50.00 60.34 
K Nearest Neighbors 46.55 44.82 50.00 
Logistic Regression 50.00 53.45 53.45 
Linear Discriminant Analysis 50.00 48.28 53.44 
Linear Support Vector Classifier 50.00 51.72 55.17 
Multi-Layer Perceptron 50.00 50.00 58.62 
Recurrent Neutral Network 69.23 51.23 5621 


In this case, most of the machine learning algorithms have accuracies in the 
order of the 50-53%. This results are not ideal, as they are no better than a random 
choice between the two classes, however this can be justified by the low population 
sample, which is not high enough to bring to the surface concrete patterns on the 
features. Regarding the deep learning approach, the RNN has an advantage in this 
case, giving a final accuracy of 69%. This result shows that this model is reliable, and 
in the majority of the cases the 2 emotions can be distinguished. In this particular 
case, the facial expressions and their muscle activity, can induce big artifacts in 
the EEG. Someone who feels surprised has the tendency to raise their eyebrows 
and open the mouth. These movements can lead to a difference in the EEG and, 
consequently, in the patterns of the features, making the distinction between surprise 
and fear more noticeable. The same thinking applies to other emotions that trigger 
facial movement, like laugh, frowning, among others. 


3.3 Stimulus vs No Stimulus Classification 


Besides the good results presented in the last premise, one last hypothesis was 
assessed, regarding the difference between experiencing the emotion while watching 
the video (direct stimulus), and after, when the fixation cross is presented, while the 
volunteer is simply thinking and cognitively processing the emotion. 
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Table 3 summarizes the results of the various models. 


Table 3 Results of Stimulus vs No Stimulus classification. 


Classification Models Raw Features Normalized Features Standardized features 
Accuracy (%) 

Gaussian Naive Bayes Classifier 61.20 58.62 85.34 
Support Vector Classifier 58.62 58.62 91.37 
Decision Tree Classifier 39.65 58.62 89.65 
Random Forest Classifier 39.65 58.62 91.37 
K Nearest Neighbors 37.93 58.62 89.65 
Logistic Regression 34.48 58.62 87.06 
Linear Discriminant Analysis 29.31 37.06 80.17 
Linear Support Vector Classifier 34.48 58.62 87.06 
Multi-Layer Perceptron 31.03 58.62 88.79 
Recurrent Neutral Network 96.55 61.20 88.79 


As it can be seen, for this experiment, most models did fairly well using the 
standardized feature, being all accuracies higher than 80%. However, when testing 
the deep learning approach, this architecture revealed to fit almost perfectly to the 
testing data, with an accuracy higher than 96%. This hypothesis is the proof of 
concept that the characteristics of the signal collected during the stimulus itself 
are very different from the ones from a signal obtained when the person is simply 
thinking and cognitively processing the emotion (this change would be obvious if 
the EEG was collected from the occipital lobe, which is responsible for the visual 
perception, but is remarkable when spotted in the prefrontal cortex). 


4 Conclusions 


In conclusion, as a first approach, the results achieved are very satisfactory and 
reveal a high potential to be greatly efficient in the proposed applications both in 
"IntellWheels2.0" and "Sleep at the Wheel projects". Nevertheless by collecting 
more data the models will get more generalized resulting in more realistic patterns 
and, consequently, increasing the prediction's accuracies. 

Comparing to the literature, using simple visual stimuli to distinguish six emo- 
tions, in a relaxed state, is a novel tactic. Most studies, complement the stimulus with 
forced facial expression, introducing different characteristics to the signal, leading 
to better results. Other studies use BCIs with more electrodes (channels), covering a 
wider cranial surface and, consequently, getting more EEG and information, which 
leads to more robust results. 

As future work, the preprocessing of the data could be polished, improving the 
removal of artifacts and enhancing the underlying information of the EEG's. To obtain 
better results, it could also be used a transfer learning approach, by pre-training the 
models with another emotion related EEG databases. 
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The Death Process in Italy Before and During 
the Covid-19 Pandemic: a Functional 
Compositional Approach 


Riccardo Scimone, Alessandra Menafoglio, Laura M. Sangalli, and Piercesare 
Secchi 


Abstract In this talk, based on [1], we propose a spatio-temporal analysis of daily 
death counts in Italy, collected by ISTAT (Italian Statistical Institute), in Italian 
provinces and municipalities. While in [1] the focus was on the elderly class (70+ 
years old), we here focus on the middle class (50-69 years old), carrying out anal- 
ogous analyses and comparative observations. We analyse historical provincial data 
starting from 2011 up to 2020, year in which the impacts of the Covid-19 pan- 
demic on the overall death process are assessed and analysed. The cornerstone of 
our analysis pipeline is a novel functional compositional representation for the death 
counts during each calendar year: specifically, we work with mortality densities over 
the calendar year, embedding them in the Bayes space B? of probability density 
functions. This Hilbert space embedding allows for the formulation of functional 
linear models, which are used to split each yearly realization of the mortality density 
process in a predictable and an unpredictable component, based on the mortality 
in previous years. The unpredictable components of the mortality density are then 
spatially analysed in the framework of Object Oriented Spatial Statistics. Via spa- 
tial downscaling of the results obtained at the provincial level, we obtain smooth 
predictions at the fine scale of Italian municipalities; this also enable us to perform 


Riccardo Scimone (È<) 
MOX, Dipartimento di Matematica, Politecnico di Milano and Center for Analysis, Decision and 
Society, Human Technopole, Milano, Italy, e-mail: riccardo.scimone@polimi.it 


Alessandra Menafoglio 
MOX, Dipartimento di Matematica, Politecnico di Milano, Milano, Italy, 
e-mail: alessandra.menafoglio@polimi.it 


Laura M. Sangalli 
MOX, Dipartimento di Matematica, Politecnico di Milano, Milano, Italy, 
e-mail: laura.sangalli@polimi.it 


Piercesare Secchi 
MOX, Dipartimento di Matematica, Politecnico di Milano and Center for Analysis, Decision and 
Society, Human Technopole, Milano, Italy, e-mail: piercesare.secchi@polimi.it 


© The Author(s) 2023 333 
P. Brito et al. (eds.), Classification and Data Science in the Digital Age, 

Studies in Classification, Data Analysis, and Knowledge Organization, 
https://doi.org/10.1007/978-3-03 1-09034-9_36 


334 R. Scimone et al. 


anomaly detection, identifying municipalities which behave unusually with respect 
to the surroundings. 


Keywords: COVID-19, O2S2, functional data analysis, spatial downscaling 


1 Introduction and Data Presentation 


At the dawn of the third year of global pandemic, we can affirm that no aspect of 
people's everyday life has been left untouched by the consequences of Covid-19. 
The virus, in addition to exacting an heavy death toll, has caused great upheavals 
in global economy, education systems, technological development and in countless 
other aspects of human life. Given this global reaching, we deem appropriate to anal- 
yse death counts from all causes, and not just those directly attributed to Covid- 19, as 
a proxy of how Italian administrative units, be they municipalities or provinces, have 
been affected by the pandemic. This choice is driven by the following considerations: 


* Death counts from all causes are, on many levels, high quality data: they have a 
very fine spatial and temporal granularity, being collected daily in each Italian 
municipality, they are finely stratified in many age classes, and they are not affected 
by errors due to incorrect attribution of the cause of death, as may happen, for 
example, in deciding whether or not a given death is due to Covid-19; 

* They incorporate any possible shock, be it direct or indirect, which the natural 
death process underwent: less deaths from road accidents due to restrictive poli- 
cies, more deaths from other pathologies which are left untreated because of the 
unnatural stress on the welfare systems, and so on; 

* They are made freely available by ISTAT!, with a substantial amounts of historical 
data; in particular, in the following analysis we consider data starting from the 
beginning of 2011 up to the end of 2020. 


The purpose of the analysis of such data is twofold: (1) to study the correlation 
structure of the death process in Italy before and during the pandemic, assessing 
possible perturbations caused by its outbreak, and (2) to assess local anomalies at 
the municipality level (i.e., identifying municipalities which behave unusually with 
respect to the surrounding). This talk will entirely be devoted to presenting data and 
results concerning people aged between 50 and 69 years. The elderly class was the 
focus of [1], while analyses focusing on younger age classes can be freely examined 
at https: //github.com/RiccardoScimone/Mortality-densities-italy 
-analysis.git. 

Daily death counts for the 107 Italian provinces, in the time interval spanning 
from 2017 to 2020, are shown in Fig. 1: for each province, we draw death counts 
along the year in light blue. The black solid line is the weighted mean number of 
deaths, where each province has a weight proportional to its population. We also 
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highlight four provinces with colours: Rome, Milan, Naples, and Bergamo. By a 
visual inspection, it is easy to see that, during the years 2017, 2018 and 2019, the 
mortality in this age class has an almost uniform behaviour, with only a very slight 
increase in deaths during winter, for some Provinces. Conversely, 2020 presents 
an abnormal behaviour in many provinces, due to the pandemic outbreak: look for 
example at the double peak for Milan, hit by both pandemic waves, or the single, 
dramatically sharp peak of Bergamo, which reached, during the first wave, higher 
death counts than the ones associated to provinces which are several times bigger, as 
Rome or Naples. By comparison with the plots in [1], on can see how all these peaks 
are less sharper with respect to the elderly class: this is perfectly reasonable, since 
people aged more than 70 years are much more susceptible to death by Covid-19. 


Daily death counts, Italian provinces, 50-69 years 
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Fig. 1 Daily death counts during the last four years, for the Italian provinces. The plots refer to 
people aged between 50 and 69 years. For each province, death counts along the year are plotted in 
light blue: curves are overlaid one on top of the other to visualize their variability. The black solid 
line is the weighted mean number of deaths, where each province has a weight proportional to its 
population, while some selected provinces are highlighted in colour. 


To set some notation, we denote the available death counts data as d;y;, where 
i is a geographical index, identifying provinces or municipalities, y is the year and 
t is the day within year y. Moreover, we denote by T;y the absolutely continuous 
random variable time of death along the calendar year, that models the instant of 
death of a person living in area i and passing away during year y. We hence consider 
the empirical discrete probability density of this random variable, 


diyt 


iyt oLu—— for t = 1,...,365 
Piyt M. diy: 


for each area i and year y. The family (pi, }iy is the main focus of our analysis: we 
show these discrete densities in Fig. 2, with the same color choices of Fig. 1. It is 
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clear that using densities provides a natural alignment of areas whose population 
differs significantly, providing complementary insights with respect to the absolute 
number of death counts: greater emphasis is given on the temporal structure of the 
phenomenon. For example, the astonishing behaviour of the province of Bergamo 
during the first pandemic wave in 2020, is now much more visible. 


Empirical densities of daily mortality, provinces, 50-69 years 
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Fig. 2 Empirical densities of daily mortality, for people aged between 50 and 69 years, at the 
provincial scale. For each province, the empirical density of the daily mortality is plotted in light 
blue: densities are overlaid one on top of the other to visualize their variability. The black solid line 
is the weighted mean density, where the weight for each province has been set to be proportional to 
its population; some selected provinces are highlighted in colour. 


In this talk, we will show results obtained by embedding a smoothed version 
of the (pi, J;y, i.e., an estimate {fj J;y of the continuous density functions of the 
{Tiy}iy, in the Hilbert space B? (©), called Bayes space [2, 4, 3], where © denotes 
the calendar year. This is the set (of equivalence classes) of functions 


B’(®) = {f : © > R? s.t. f > 0, log(f) € L7(®)} 


where the equivalence relation in B*(®) is defined among proportional functions, 
ie. f 2g gif f = ag for a constant a > 0. In [1], we also propose a preliminary 
exploration of the (p; );, based on the Wasserstein space embedding, a very regular 
metric space of probability measures with a straightforward physical interpretation 
[5]. For the sake of brevity, we here focus on the analysis in B? (©), which constitutes 
our main contribution. 

B? (O) is equipped with an Hilbert geometry, constituted by appropriate operations 
of sum, multiplication by a scalar, and inner product, which make it the infinite- 
dimensional counterpart of the Aitchison simplex used in standard compositional 
analysis [6, 7]: for this reason this space is considered the most suited Hilbert 
embedding for positive continuous density functions. The smoothed densities ( fiy }iy 
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Fig. 3 Smooth estimates of the mortality densities over the 107 Italian provinces. The usual pattern 
of mortality is visible till 2019, while the functional process is completely different in 2020, with 
the two pandemic waves clearly captured by the estimated densities. The black thick lines represent 
the mean density, computed in B?, with weights proportional to the population in each area. 


are shown in Fig. 3: they are obtained by smoothing the (p; jJ; via compositional 
splines [8, 9]. It is easy to see, by comparison with Fig. 2, how smoothing filters out 
a good amount of noise, much more than the case of the elderly class: this is fairly 
reasonable, since the death process is usually more noisy for younger age classes. 
From now on, the { fiy }iy are analysed as a spatio-temporal functional random sample 
taking values in B?(0). We briefly anticipate the results of such analysis: 


1. 


The (fiy); are decomposed, by means of a linear model formulated in B? (©) 
[10], in a predictable and an unpredictable part, on the basis of mortality during 
previous years; 

The unpredictable part is then analysed spatially in order to infer the main 
spatial correlation characteristics of the process; in particular, the impacts of 
the pandemic are investigated via functional variography [13, 14, 11, 12] and 
Principal Component Analysis in the B? space (SFPCA, [16]); 

The results obtained at the provincial level are reduced to the municipality scale 
by spatial downscaling [15] techniques, obtaining smooth density estimates 
for each municipality. This provides continuous density at the municipality 
level, without directly smoothing the corresponding daily death process, which 
is quite irregular due to the reduced population of many municipalities. The 
spatial downscaling estimates, that are exclusively based on provincial data, are 
then compared with the actual measurements on municipalities, allowing for the 
identification of local anomalies. 


Points 1 and 2 above are detailed in Section 2, while point 3 will be discussed during 
the talk. The reader is referred to [1] for full details on the analysis pipeline. 
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2 Some Results 


The first step of the analysis of the random sample ( fiy }iy, where i is indexing the 
107 Italian provinces, is the formulation of a family of function-on-function linear 
models in B? (0), extending classical models formulated in the L? case [17], namely 


fi (t) = Boy(t) + (By CD fiy) + y(t), i21,.107, te8, (1) 


where fiy = i 2 D 4 fir is the B? mean of the observed densities in the four years 
preceding year y, functional parameters Bo, (1), By (s, t) are defined in the B? sense, 
as well as the residual terms ej, (f) and all operations of summation and multiplication 
by a scalar. Model (1) is trying to explain the realization of the mortality density 
fiy for a year y in a province i as a linear function of what happened in the same 
province during the preceding years. It is thus interesting to look at the following 
functional prediction errors: : 

Siy = fiy — fiy (2) 
where 


fiy(t) = Boy-1(t) "b (Gy C. t), F iy) p?- (3) 


The ój, are not the estimate €; of the residual of model (1): they rather represent 


Prediction error norms and B? functional clustering, provinces, 50-69 years 
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Fig. 4 First four panels, from the left: heatmaps of the B? norm of the prediction errors ó;,, in 
logarithmic scale, for the elderly class. In 2020 the pandemic diffusion is clearly visible in northern 
Italy, while the prediction errors are generally higher on all provinces. Last panel: result of a 
K -mean B? functional clustering (K = 3) on the ó;,, during 2020. 


the error committed in forecasting fiy using the model fitted at year y — 1. Thus, 
we can look at the densities 6;, as the unpredictable component of fiy, i.e., as a 
proxy of what happened at year y which could not be predicted by information 
available at the previous years, and analyze them under the spatial viewpoint. For 
example, we can look at the spatial heatmaps of the B? norms of the iy, Which are 
shown in Fig 4. It is clear, by looking at the magnitude of the error norms, that what 
happened during 2020 was to a large extent unpredictable, since almost all Italian 
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provinces are characterized by higher errors with respect to previous years. More 
significantly, in 2020 a clear spatial pattern can be noticed, at least during the first 
wave in northern Italy: a diffusive process, having at its core the provinces most 
gravely hit by the first pandemic wave, seems to take place in northern Italy. This 
pattern is, as reasonable, slightly less evident with respect to the case of the elderly 
class analysed in [1]. Going in this direction, we also show in Fig 4 the result of 
a K-means functional clustering, set in the B? space, of the Ojy for the year 2020. 
We clearly identify provinces hit by the first wave (blue cluster), while the other two 
clusters behave irregularly: this is a neat distinction with people aged more than 70 
years, where each cluster clearly identifies different kinds of pandemic behaviour 
(see [1]). For a more precise investigation of the spatial correlation structure of the 
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Fig. 5 Empirical trace-semivariograms for the prediction errors 6;y, in people aged between 50 
and 69 years. The purple lines are the corresponding fitted exponential models. Distances on the 
x-axes are expressed in kilometers. The last panel shows the 2020 severe perturbation of the spatial 
dependence structure of the process generating the prediction errors. 


process across different years, from the 6; we compute a functional trace variogram 
for each year: we show them for 2017 up to 2020 in Figure 5. Without entering into 
the details of the mathematical definition of variograms, we can look at the fitted 
curves in Figure 5 as follows. Distances are on the x-axis, while on the y-axis we 
have a function of the spatial correlation of the process: when the curve reaches its 
horizontal asymptote, it has reached the total variance of the process and we are 
beyond the maximum correlation length. In this perspective, it is immediate to infer 
that not only the total variance of the functional process ó;, has sharply increased 
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in 2020, but also a significant spatial correlation has manifested, compatibly with 
the presence of a pandemic. In the main work [1], we further deepen the connection 
between the pandemic and the upheavals in the spatial structure by means of Principal 
Component Analysis of the 6;, in the Bayes space (SFPCA, [16]). 
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Clustering Validation in the Context of 
Hierarchical Cluster Analysis: 
an Empirical Study 


Osvaldo Silva, Áurea Sousa, and Helena Bacelar-Nicolau 


Abstract The evaluation of clustering structures is a crucial step in cluster analysis. 
This study presents the main results of the hierarchical cluster analysis of variables 
concerning a real dataset in the context of Higher Education. The goal of this 
research is to find a typology of some relevant items taking into account both the 
homogeneity and the isolation of the clusters.Two similarity measures, namely the 
standard affinity coefficient and Spearman's correlation coefficient, were used, and 
combined with three probabilistic (AVL, AVB and AV/) aggregation criteria, from 
a parametric family in the scope of the VL (Validity Link) methodology. The best 
partitions were selected based on some validation indices, namely the global STAT 
levels statistics and the measures P(I2, 2) and y, adapted to the case of similarity 
coefficients. In order to evaluate the clusters and identify their most representative 
elements, the Mann and Whitney U statistics and the silhouette plot were also used. 
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1 Introduction 


Cluster analysis or unsupervised classification usually concerns exploratory multi- 
variate data analysis methods and techniques for grouping either a set of data units 
or an associated set of descriptive variables in such a way that elements in the same 
group (cluster) are more similar to each other than elements in different clusters [6]. 
Therefore, it is important to validate the results obtained, bearing in mind that, in 
an ideal situation, the clusters should be internally homogeneous and externally well 
separated or isolated. Thus, according to Silva et al. ([15], p. 136), there are some 
important questions, such as: “i) How to compare partitions obtained using different 
cluster algorithms? ii) Is it possible to join information from several approaches in 
the decision-making process of choosing the most representative partition?" 

This paper presents the main results of a hierarchical cluster analysis of variables 
concerning a real dataset in the field of Higher Education, in order to find a typology 
taking into account relevant validation measures. Two similarity measures (standard 
affinity coefficient and Spearman's correlation coefficient) were used, and combined 
with a parametric family aggregation criteria in the scope of the VL methodology 
(e.g., [10, 11, 17]). 

With regard to the validation of clustering structures, some validation indices 
were used for the evaluation of partitions and the clusters that integrate them, which 
are referred to in Section 2. The main results are presented and discussed in Section 
3. Section 4 contains some final remarks. 


2 Data and Methods 


Data were obtained from a questionnaire administered to three hundred and fifty 
students who were attending Higher Education in a public university, after their 
informed consent. The questionnaire contains, among others, eleven questions related 
to academic life and the respective courses. 

Several algorithms of hierarchical cluster analysis of variables were applied 
on the data matrix. The variables (items) are: T1-Participation, T2-Interest, T3- 
Expectations, T4-Accomplishment, T5-Job Outlook, T6- Teachers’ Professional 
Competence, T7-Distribution of Curricular Units, T8- Number of weekly hours 
of lessons, T9-Number of hours of daily study, T10-School Outcomes and T11- 
Assessment Methods, which were evaluated based on a Likert scale from 1 to 5 
(1-Totally disagree, 2- Partially disagree, 3- Neither disagree nor agree, 4- Partially 
agree, 5- Totally agree). 

The Ascendant Hierarchical Cluster Analysis (AHCA) was based on the standard 
affinity coefficient [1, 17] and Spearman's correlation coefficient. In this paper both 
measures of comparison were combined with three probabilistic aggregation criteria 
(AVL, AVB and AV1), issued from the VL parametric family. This methodology, in the 
scope of Cluster Analysis, uses probabilistic comparison functions, between pairs of 
elements, which correspond to random variables following a unit uniform distribu- 
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tion. Besides, this approach considers probabilistic aggregation criteria, which can 
be interpreted as distribution functions of statistics of independent random variables, 
that are i.i.d. uniform on [0, 1] (e.g., [17]). 

Let A and B be two clusters with cardinals, respectively, œ and £, and let y, 
be a similarity measure between pairs of elements, x, y € E (set of elements to 
classify). Concerning the family I of AVL methods (e.g., SL, AVI, AVB, and AVL), 
the comparison functions between clusters can be summarized by the following 
conjoined formula: 

T(A, B) = (pag) P (1) 


where a = Card A, B = Card B, pag = max[Yap : (a,b) € (A x B], with 
1 < g(a, B) < af, and yxy, establishing a bridge between SL and AVL methods 
which have a braking effect on the formation of chains. For example, g(a, B) = 1 for 
SL, g(a, B)=(a@ + B)/2 for AVI, g(a, B)-NaB for AVB, and g(a, 8) = aß for AVL 
(see [3, 17]). 

The application of the two measures of comparison between elements (Spearman 
correlation coefficient and standard affinity coefficient), combined with the afore- 
mentioned aggregation criteria, aims to find a typology of items corresponding to 
the best partition among the best partitions obtained by the several algorithms, in 
order to verify if there are any substantial changes in the results. Therefore, some 
validation indices based on the values of the corresponding proximity matrices were 
used, namely the global levels statistics (STAT) [1, 10, 11] and the indices P(I2mod, 
=x) and y [8], adapted to this type of matrices [16], so that the choice of the best 
partition is judicious and based on the desirable properties (e.g., isolation and homo- 
geneity of the clusters). Concerning the best partitions, the respective clusters and 
the identification of their most representative elements were based on appropriate 
adaptations of the Mann and Whitney U statistics [8] and of the silhouette plots [14] 
to the case of similarity measures. 

Each level of a dendrogram corresponds to a stage in the constitution of the 
partitions hierarchy. Therefore, the study of the most relevant partition(s) is strictly 
related to the choice of the best cut-off levels (e.g., [6, 5]) 

According to Bacelar Nicolau [1, 2], the global levels statistics (STAT) values 
must be calculated for each of the k = 1, nivmax levels of the corresponding den- 
drograms, designating them by ST AT(Kk). At each level k, STAT (k) is the global 
statistics that measures the total information given by the pre-order associated to 
the corresponding partition, in relation to the initial pre-order associated with the 
similarity or dissimilarity measure. A "significant" level is considered to be one that 
corresponds to a partition for which the global statistics undergoes a significant in- 
crease in relation to the information provided by neighbouring levels, that is, a local 
maximum of the differences DIF(k) = STAT(k) — STAT(k — 1), k = 1, nivmax. 
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2.1 Adaptation of the P (I2, X) 


To evaluate the partitions, an appropriate adaptation of the index P (I2, £) [8] for the 
case of similarity measures was used, given by the following formula: 


X Sij 


15s i6je6 
P(I2mod, Y) = - bj ar a ay (2) 


r=1 
where c is the number of clusters of the partition and s;; is the value of the similarity 
measure between the element i belonging to cluster C, and the element j belonging 
to another cluster. This index takes into account the number of clusters and the 
number of elements in each of the clusters and evaluates the isolation of clusters 
belonging to a given partition. 


2.2 Goodman and Kruskal Index (y ) 


The y index, proposed by Goodman and Kruskal [7], has been widely used in cluster 
validation [9]. Comparisons are developed between all within-cluster similarities, 
Sij and all between-cluster similarities s; [18]. A comparison is judged concordant 
(respectively discordant) if s;; is strictly greater (respectively, smaller) than s4;. The 
y index is defined by: 

y = (5. - 5-)/ (S. 8), 3) 


where S, (or S_) is the number of concordant (respectively, discordant) comparisons. 
This index is a global stopping rule and it evaluates the fit of the partition in c clusters 
based on the homogeneity (high similarity between the elements within the clusters) 
and the isolation (low similarity of the elements between the clusters) of the clusters. 
Note that the higher the value of this index, the better is the adjustment of that 
partition. 

The use of STAT, y and P(I2mod, X) indices can help identifying the most 
significant levels of a dendrogram, taking into account both the homogeneity and 
the isolation of the clusters [15]. 


2.3 U Statistics (Mann and Whitney) 


U statistics [12] are relevant for assessing the suitability of a cluster, combining the 
concepts of compactness and isolation. Thus, the “best” cluster is the one with the 
lowest values of global U-index, UG, and local U-index, Uz [8]. In the present paper 
we used an appropriate adaptation of these indices to the case of similarity measures 
(for details, see [19]). Moreover, the clusters considered “ideal” are those for which 
Ug and Uz both take the value zero. Mann and Whitney's U statistics are useful in 


Clustering Validation: an Empirical Study 347 


decision making, in situations of uncertainty, both for the evaluation of the clusters 
and partitions. 


2.4 Silhouette Plots 


We also used an appropriate adaptation of the silhouette plots [14], which allows 
the assessment of compactness and relative isolation of clusters. The adaptation of 
this measure for the case of similarity measures, Sil (i), considers the average of the 
similarities between an element i belonging to cluster C, , which contains n, (> 2) 
elements, and all other elements that do not belong to this cluster (see [19]). The 
values of this measure (Sil(i) : i € C, } lie between —1 and +1, with “values near +1 
indicating that element strongly belongs to the cluster in which it has been placed" 
([8], p. 205). In the case of a singleton cluster, Sil (i) assumes the value zero [8] in 
the corresponding algorithm. 


3 Results and Discussion 


The best partitions provided by the dendrograms are shown in Table 1. 


Table 1 The best partitions concerning the dendrograms. 


Coefficient Method The best partition Validation indices 


Affinity AVL (TL, T3, T4, T5,T6, T7, T8, T10, T11), (T2, T9) STAT-5.1301 
y- 0.8589 
P(I2mod,X)-0.2077 


AVI/AVB (T1, T3, T4 , T5, T6, T7, T8, T10, T11), (T2), (T9) STAT-5.3453 
y= 0.8830 
P(I2mod,X)-0.2049 


Spearman AVL (T3, T4, T2, T9) (T7, T11, T8), (T6, T10), (T), (T5) STAT=4.0152 
y= 0.8178 
P(I2mod,X)-0.3896 


AVI/AVB (T3, TA ,T2 , T9, T6 ) (T7, T1 1, T8), (T1, T10), (T5) STAT-4.05751 
y= 0.7317 
P(I2mod,X)-0.38177 


Figure 1 shows the dendrograms obtained, respectively, by the standard affin- 
ity coefficient (left side) and Spearman's correlation coefficient (right side), both 
combined with the AVL method. 
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Fig. 1 Dendrograms based on standard affinity coefficient (left side) and Spearman's correlation 
coefficient (right side) - AVL. 


The "best" partition obtained using the affinity coefficient and the AVL method is 
the partition into two clusters (level 9 of the aggregation process). The first cluster 
consists of nine items that highlight the importance of the teachers’ professional 
competence, the structuring/content of the course and the future perspectives in 
relation to the career opportunities, mostly factors exogenous to the students. The 
second one is composed by two items (T2 and T9) which emphasize the role of 
interest in the study of Mathematics. 

The algorithms in which the standard affinity coefficient was used are the ones that 
provided the best partitions and their hierarchies are the ones that remained closest 
to the initial pre-orders. In fact, in the case of Spearman correlation coefficient the 
values of STAT and y indices are clearly lower than the previous ones. Moreover, 
the cluster (T1, T3, T4, T5, T6, T7, T8, T10, T11}, corresponding to the best 
partition provided by the combination of the standard affinity coefficient with the 
aggregation criteria AVL, AV] and AVB, presents (Ug =39 and U; -4, both lower than 
those obtained for the cluster (T3, T4, T2, T9, T6} (UG=65 and U; 226) provided 
by the Spearman correlation coefficient combined, respectively, with AV/ and AVB 
methods. 

Focusing the attention on the two first partitions of Table 1, the only difference 
between them is that while the best partition provided by AV/ and AVB methods 
contains the singletons T2 and T9, the best partition given by AVL joins these two 
singletons in the same cluster. The values of the numerical validation indices shown 
in Table 1 indicate that the best partition is the one provided by AV/ and AVB 
methods. This conclusion is reinforced by the observation of the silhouette plot (see 
Figure 2), which indicates that the cluster joining T2 and T9, given by AVL method, 
includes the elements which have the two lowest values of Sil and Sil (T2) is negative 
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Fig. 2 Silhouette plot - standard affinity coefficient and AVL method. 


(i.e., T2 does not fit very well in this cluster). Note that the silhouette plot cannot be 
used for the best partition, since it does not apply for singletons. 


4 Final Remarks 


This research was useful concerning the identification of relevant partitions of items 
in the context of Higher Education. In the cases where the affinity and the Spearman 
correlation coefficients were used, it was concluded that the probabilistic criteria AV/ 
and AVB showed a higher agreement regarding the hierarchies of partitions obtained 
than the AVL method. 

The validation measures STAT, y and P(I2mod, 2X) help us to determine the best 
cut-off levels of a hierarchy of clusters, taking into account both the homogeneity 
and the isolation of the clusters. It should also be noted that if there is no absolute 
consensus between these three measures, the Mann and Whitney U statistics and the 
silhouette plot prove to be very useful, as we have seen with the application of this 
methodology to evaluate both the clusters and the partitions obtained. 
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An MML Embedded Approach for Estimating 
the Number of Clusters 


Cláudia Silvestre, Margarida G. M. S. Cardoso, and Mário Figueiredo 


Abstract Assuming that the data originate from a finite mixture of multinomial 
distributions, we study the performance of an integrated Expectation Maximization 
(EM) algorithm considering Minimum Message Length (MML) criterion to select 
the number of mixture components. The referred EM-MML approach, rather than 
selecting one among a set of pre-estimated candidate models (which requires run- 
ning EM several times), seamlessly integrates estimation and model selection in a 
single algorithm. Comparisons are provided with EM combined with well-known 
information criteria — e.g. the Bayesian information Criterion. We resort to synthetic 
data examples and a real application. The EM-MML computation time is a clear ad- 
vantage of this method; also, the real data solution it provides is more parsimonious, 
which reduces the risk of model order overestimation and improves interpretability. 


Keywords: finite mixture model, EM algorithm, model selection, minimum mes- 
sage length, categorical data 


1 Introduction 


Clustering is a technique commonly used in several research and application areas. 
Most of the clustering techniques are focused on numerical data. In fact, clustering 
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methods for categorical data are more challenging [12] and there are fewer techniques 
available [11]. 

In order to determine the number of clusters, model-based approaches commonly 
resort to information-based criteria e.g., the Bayesian Information Criterion (BIC) 
[15] or the Akaike Information Criterion (AIC) [1]. These criteria look for a balance 
between the model's fit to the data (which corresponds to maximizing the likelihood 
function) and parsimony (using penalties associated with measures of model com- 
plexity), thus trying to avoid over-fitting. The use of information criteria follows the 
estimation of candidate finite mixture models for which a predetermined number 
of clusters is indicated, generally resorting to an EM (Expectation Maximization) 
algorithm [7]. In this work, we focus on determining the number of clusters while 
clustering categorical data, using an EM embedded approach to estimate the number 
of clusters. This approach does not rely on selecting among a set of pre-estimated 
candidate models, but rather integrates estimation and model selection in a single 
algorithm. Our new implementation to deal with categorical variables by estimating 
a finite mixture of multinomials, follows a previous version described in [16]. We 
capitalized on the work of Figueiredo and Jain [9] for clustering continuous data and 
extended it for dealing with categorical data. The embedded method is thus based on 
a Minimum Message Length (MML) criterion to select the number of clusters and 
on an EM algorithm to estimate the model parameters. 


2 Clustering with Finite Mixture Models 


The literature on finite mixture models and their application is vast, including some 
books covering theory, geometry, and applications [8, 13, 3]. When applying finite 
mixture models to social sciences, the analyst is often confronted with the need to 
uncover sub-populations based on qualitative indicators. 


2.1 Definitions and Concepts 


Let Y = ib». i = 1,...,n} be a set of n independent and identically distributed 
(i.i.d.) sample of observations of a random vector, Y = [Yi,...,Yr]'. We assume 
Y follows a mixture of K components densities, f(y|9,) (k = 1,..., K), with 
probabilities (a,..., ok), where 6, are the distributional parameters defining the 
k-th component and © = (0,,...,0,,0,,..., ak] the set of all the parameters of 
the model. The a values, also called mixing probabilities, are subject to the usual 
constraints: jm ay = 1 and ax = 0, k = 1,...,K. The log-likelihood of the 
observed set of sample observations is 


n n K 
log f(¥I®) = log | | f(,10) = 5 log 5 arf @,18,). (1) 
i=1 i=1 k=1 
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In clustering, the identity of the component that generated each sample observa- 
tion is unknown. The observed data Y is therefore regarded as incomplete, where 
the missing data is a set of indicator variables Z = {z> iz. each taking the 
form z, = [zii -Zig |’, where z;, is a binary indicator: z;; takes the value 1 if the 
observation y, Was generated by the k-th component, and 0 otherwise. It is usually 
assumed that t the {z, , i=1,...,n} are i.i.d., following a multinomial distribution of 
K categories, with probabilities {a,,...,@K}. The log-likelihood of complete data 
{Y, Z} is given by 


n K 


log f (Y, ZIO) = 9 >> zi log [ax fO 10] - Q) 


i=] k=1 


2.2 Discrete Finite Mixture Models 


Consider that each variable in Y, Y; (J = 1,..., L) can take one of C; categories. 
Conditionally on having been generated by the k-th component of the mixture, 
each Y; is thus modeled by a multinomial distribution with n; trials, C; categories, 
and non-negative parameters 0,, = {@xic, c = 1,..., Cj), with xt kle = 1. 
For a sample yj;j(i = 1,...,n) of Yı, we denote as yic the number of outcomes 
in category c, which is a sufficient statistic; naturally, x pile = M. Thus, with 
0, = (0,,,..., 0,1) and © = (0,,...,04,01,..., ax}, the log-likelihood function, 
for a set of observations corresponding to a discrete finite mixture model (mixture of 
multinomials). This log-likelihood can be seen as corresponding to a missing-data 
problem, where the missing data has exactly the same meaning and structure as 
above. The log-likelihood of the complete data {Y, Z} is thus given by 


Ei (Oxte) 
log p(Y, Z|0) -Ý Y enoe ZI B 


i=l k= c=1 


(3) 


To obtain a maximum-likelhood (ML) or maximum a posteriori (MAP) estimate 
of the parameters of a multinomial mixture, the well-known EM algorithm is usually 
the tool of choice [7]. 


3 Model Selection for Categorical Data 


Model selection is an important problem in statistical analysis [6]. In model-based 
clustering, the term model selection usually refers to the problem of determining 
the number of clusters, although it may also refer to the problem of selecting the 
structure of the clusters. Model-based clustering provides a statistical framework to 
solve this problem usually resorting to information criteria. Among the best-known 
information criteria we find BIC and AIC, their modifications - namely the consistent 
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AIC, (CAIC) and the Modified AIC (MAIC) - and also the Integrated Completed 
Likelihood (ICL) [14, 4]. They are all easily implemented, the final model being 
selected according to a compromise between its fit to data and its complexity. 
In this work, we use the Minimum Message Length (MML) criterion to choose 
the number of components of a mixture of multinomials. MML is based on the 
information-theoretic view of estimation and model selection, according to which an 
adequate model is one that allows a short description of the observations. MML-type 
criteria evaluate statistical models according to their ability to compress a message 
containing the data, looking for a balance between choosing a simple model and 
one that describes the data well. According to Shannon's information theory, if Y is 
some random variable with probability distribution p(y|9), the optimal code-length 
(in an expected value sense) for an outcome y is /(y|9) = — log, p(y|©), measured 
in bits (from the base-2 logarithm). If O is unknown, the total code-length function 
has two parts: /(y, 9) = /(y|9) + (©); the first part encodes the outcome y, while 
the second part encodes the parameters of the model. The first part corresponds the 
fit of the model to the data (better fit corresponds to higher compression), while the 
second part represents the complexity of the model. The message length function 
for a mixture of distributions (as developed in [2]) is: 


1 C 
I(y, 9) = -log p(9) — log p(yl8) + 5 log|/(6)] + > (1 — log(12)), (4) 


where p(G) is a prior distribution over the parameters, p(y|©) is the likelihood 


function of mixture, |/(0)| = | - E E log p(Y e| | is the determinant of the 
expected Fisher information matrix, and C is the the number of parameters of 


the model that need to be estimated. For example, for the K mixture multinomial 
distributions presented in (3), C = (K - 1) - K (xt - D). The expected Fisher 
information matrix of a mixture leads to a complex analytical form of MML which 


cannot be easily computed. To overcome this difficulty, Figueiredo and Jain [9] 
replace the expected Fisher information matrix by its complete-data counterpart 


I.(@) = -E E log p(Y,Z je. Also, they adopt independent Jeffreys' priors for 
the mixture parameters that is proportional to the square root of the determinant of 


the Fisher information matrix. The resulting message length function is 


M n Qk Kn n ky, (M 1) 
I: dy. — log (2) Kiz pe e A ARS T. sg ; 
(9) 7 5, le (^15) * 5: 1og15 * 5 og p(y.®) (5) 
k: ay 20 
where M is the number of parameters specifying each component (the dimension 
of each 0,) and knz the number of components with non zero probability (for more 
details on the derivation of (5), see [9, 2]). 
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4 The MML Based EM Algorithm 


In order to estimate a mixture of multinomials, we use a variant of the EM algorithm 
(herein termed EM-MML), which integrates both estimation and model selection, by 
directly minimizing (5). The algorithm results from observing that (5) contains, in 
addition to the log-likelihood term, an explicit penalty on the number of components 
(the two terms proportional to knz), and a term (the first one) that can be seen as a 
log-prior on the o; parameters of ©, that will directly affect the M-step. 


E-step: The E-step of the EM-MML is precisely the same as in the case of ML or 
MAP estimation, since the generative model for the data is the same. Since we are 
dealing with a multinomial mixture, we simply have to plug the corresponding 
multinomial probability function yielding 


Yüc! 


A(t) \y; 
c, (0j) üc 
aK nt 1 E IL 2 kle 0o. 


s(t) _ 
Zik 7 


(8 vu (6) 
C, a tlc 
m Qj Tra | IL, ud | 


fori=1,...,.nandk=1,...,K 


M-step: For the M-step, noticing that the first term in (5) can be seen as the 
negative log-prior — log p(o,) = K+! ed log a; (plus a constant), and enforcing 


the conditions that a, > 0, for k = m ..., K and that YT ax = 1, yields the 
following updates for the estimates of the v; parameters: 


n 
-æ C-K+1 
max 2 TE Ix c 
ZUR) _ i-l 
Q 


k , 
> max fo zo C-K+1 
Zij 2K 


=1 


(7) 


for k = 1,..., K. Notice that, some gm may be zero; in that case, the k-th 


component is excluded from the raisins model. The multinomial parameters 
corresponding to components with ai? = 0 need not be further calculated, 
since these components do not contribute to the likelihood. For the components 
with non-zero probability, œ aro > 0, the estimates of multinomial parameters 
are updated to their standard Aveiphted ML estimates: 


dH Yilc 


gu D (8) 


klc 
5 
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fork =1,...,K,/=1,...,L,andc =1,...,Cj. Notice that, in accordance with 


the meaning of the 05; parameters, xv gun -]. 


5 Data Analysis and Results 


First, we evaluate the performance of the EM-MML algorithm on 10 synthetic data 
sets, over 50 runs. The data sets were originated from a mixture of 3 categorical 
variables (with 2, 3 and 4 levels) and 2 components. The correponding Sihouette 
index values illustrate the structures diversity: 0.099; 0.216; 0.217; 0.230; 0.713; 
0.733; 0.746; 0.778; 0.805; 0.817. The obtained results are compared with those 
obtained from a standard EM algorithm combined with BIC, AIC, CAIC, MAIC, 
and ICL criteria. 

The comparison resorts to a cohesion-separation measure and a concordance 
measure: the Fuzzy Silhouette index [5] of the clustering structure obtained and the 
Adjust Rand [10] between the same clustering structure and the original one. In 
Table 1 we can verify there are no significant differences between the EM-MML and 
the other criteria, except ICL which only recovers the very well separated structures. 
Regarding the number of clusters, EM-MML and MAIC are tied, recovering this 
number correctly for all data sets. The same is not true for the other criteria: AIC 
identifies 3 clusters in 3 data sets and 4 clusters once; in addition, BIC and CAIC 
could not find any cluster structure once and ICL was unable to do it for 4 data sets. In 
terms of computation time, since EM-MML does not require a sequential approach, 
it becomes clearly faster than the other criteria (Friedman test yields y?(5)-2500 
and p-value«0.01; Post hoc tests, with Bonferroni correction, only reveal statistically 
significant differences between the EM-MML and the other criteria). 


Table 1 Criteria performance. 


Criterion Number of Fuzzy Silhouette: 95% CI Adjusted Rand: 95% CI 
data sets Lower ; Upper Limits^ Lower ; Upper Limits^ 
AIC 10 0.430 ; 0.741 0.545 ; 0.867 
BIC 9 0.622 ; 0.935 0.728 ; 1.000 
CAIC 9 0.616 ; 0.931 0.732 ; 1.000 
ICL 6 0.917 ; 0.948 1.000 ; 1.000 
MAIC 10 0.568 ; 0.887 0.623 ; 0.950 
EM-MML 10 0.561 ; 0.891 0.594 ; 0.955 


^ 1000 bootstrap samples were used to estimate the Confidence Intervals (CI). 


Additional insight into the performance of EM-MML is obtained by applying it 
to a real data set referring to the 6th European Working Conditions Survey (2015), 
Eurofound working conditions survey. Note that these data are the most recent. 

For the purpose of our experiment, we consider the aggregate data referring to 
305 European regions and the answers to the following questions: Are you able to 
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Bl work m a group or team 


^ DERE clle niis d : J 
Cluster 7 (n=86): workers yer da nit SS N speed or rate of work 
ne p "tst RS methods of work 
Cluster 6 (n=9): fairly autonomous workers and [ic a *** order of tasks 


used to team work but pressured to finish tasks on | HAAN 


ume| Haat 


Cluster 5 (n=19): dependent workers 


Cluster 4 (n=24): non-autonomous workers, but MOO 
used to work in a team e 


Cluster 3 (n=40): independent workers | 


IE 
Cluster 2 (n283): workers with slightly below N 
average autonomy | SERON 
^ [i 
Cluster 1 (n=44): fairly autonomous workers, not Ww 


used at all to team work 


Zscores 


Fig. 1 Clusters’ profile and their dimensions (7). 


choose or change: a) your order of tasks; b) your methods of work; c) your speed or 
rate of work. Do you work in a group or team that has common tasks and can plan 
its work? 

EM-MML selected 7 clusters, which is a smaller number than for the remaining 
criteria (ICL, BIC, CAIC, AIC and MAIC select 10, 12, 12, 15 and 15 respectively). 
This fact avoids estimation problems associated with very small segments and also 
improves the interpretability of the clustering solution. 

The segments selected by EM-MML criterion are presented in Figure 1. Workers 
with slightly above average autonomy (cluster 7) live in several countries, but Ireland 
stands out, as well as Belgium, Germany, Netherlands, Switzerland, and the UK 
regions. Denmark, Estonia, Malta, and Norway are the countries where the most 
independent workers are found (cluster 3). The smallest cluster, 6, includes Sweden 
and a region of Greece and Kriti and Açores, a Greek and a Portuguese region, 
respectively. The cluster 5, where workers claim they have no autonomy, includes 
regions from many countries. 


6 Discussion and Perspectives 


In this work, a model selection criterion and method for finite mixture models of 
categorical observations was studied - EM-MML. This algorithm simultaneously 
performs model estimation and selects the number of components/clusters. When 
compared to information criteria, which are commonly associated with the use of 
the EM algorithm, the EM-MML method exhibits several advantages: 1) it easily 
recovers the true number of clusters in synthetic data sets with various degrees of 
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separation; 2) its computations times are significantly lower than those required 
by standard approaches resorting to the sequential use of EM and an information 
criterion; 3) when applied to a real data set it produces a more parsimonious solution, 
thus easier to interpret. An additional advantage of this approach that stems from 
obtaining more parsimonious solutions is that such solutions have a higher number 
of observations per cluster, thus helping to overcome eventual estimation problems. 

The performance of the EM-MML is encouraging for selecting the number of 
clusters, and the same criterion was already used for feature selection [17]. However, 
future research is required, namely considering data sets with different numbers of 
clusters and high dimensional data. 
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Typology of Motivation Factors for Employees in 
the Banking Sector: An Empirical Study Using 
Multivariate Data Analysis Methods 


Áurea Sousa, Osvaldo Silva, M. Graça Batista, Sara Cabral, and Helena 
Bacelar-Nicolau 


Abstract Leadership has been considerate as a competitive advantage for organiza- 
tions, contributing to their success and effective and efficient performance. Motiva- 
tion, on the other hand, is assumed as a basic competence of leadership. Therefore, 
the main purpose of this paper is to know the perceptions of bank employees on 
the main motivational factors in the organizational context. Data analysis was per- 
formed based on several statistical methods, among which the Categorical Principal 
Component Analysis (CatPCA) and some agglomerative hierarchical clustering al- 
gorithms from VL (V for Validity, L for Linkage) parametrical family, applied to the 
items that aim to assess the aspects most valued by bankers in the work context. The 
CatPCA allowed to extract four principal components which explain almost 70% 
of the total data variance. The dendrograms provided by the hierarchical clustering 
algorithms over the same data, exhibit four main branches, which are associated with 
different main motivational factors. Moreover, CatPCA and clustering results show 
an important correspondence concerning the main motivations in this sector. 
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1 Introduction 


Motivation has always been subject of analysis by the scientific community, as 
numerous definitions have emerged. For Robbins and Judge ([21], p. 184), motivation 
is defined as “the processes that account for an individual's intensity, direction, and 
persistence of effort toward attaining a goal". These three indicators are assumed 
to be key-factors of motivation: intensity describes the individual's effort to achieve 
the proposed goals; this effort should go in a direction that benefits the organization; 
and, finally, the persistence with which the individual is able to maintain that effort. 
In this context, the individual's behavior is determined by what motivates them, 
which is why their performance results not only from ability and skills, but also 
from motivation. Moreover, motivation is complex and influenced by innumerable 
variables, considering the diverse needs and expectations that individuals try to 
satisfy in different ways [15]. Moreover, different leadership practices may lead to 
better or worse motivational responses from employees. 

The main purpose of this paper is to analyse the perceptions of bank employees 
who work in the banks that operate in the Autonomous Region of the Azores on 
the main motivational factors in the organizational context. Our study also intends 
to perform a reduction of the dimensionality of the data and to find a typology of a 
set of items that was used to evaluate the latent variable “Motivation”, regarding the 
most valued aspects in the work context. Thus, Section 2 concerns the materials and 
methods of research. Section 3 presents and discusses the main results of this study. 
Finally, Section 4 contains the main conclusions. 


2 Materials and Methods 


This study was based on a quantitative approach, using a validated questionnaire, 
which can be found in Cabral [7]. The sample consists of 202 bank employees (51.0 
% male and 49.0 % female) of the Autonomous Region of the Azores (response 
rate: 6.4%). Most respondents are 36 years old or older (60.9%) and have higher 
education (56.7%). 

The present study refers to a subset of twenty-seven items used to evaluate the 
latent variable “Motivation” in work context, namely: 1 - The opportunity for career 
advancement, 2 - Have greater responsibility, 3 - The feeling of being involved 
in decision making, 4 - A job that gives you prestige and status, 5 - Have an 
interesting and challenging job, 6 - The recognition and appreciation of others for 
the accomplished work, 7 - Have a good relationship with your colleagues, 8 - Have 
a good relationship with your superiors, 9 - A work environment where there is trust 
and respect, 10 - The loyalty of superiors towards the collaborators, 11 - Team spirit, 
12 - Sense of belonging to the organization, 13 - An adequate discipline, 14 - There 
is equality of treatment and opportunities between the various employees, 15 - Earn 
respect and esteem of your colleagues and superiors, 16 - Professional development, 
17 - Salary appropriate to the professional functions, 18 - A stable job that gives 
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you security, 19 - Good working conditions, 20 - Balance between personal and 
professional life, 21 - Being able to express your opinion and ideas without fear of 
reprisals, 22 - Availability to solve problems/personal situations, 23 - Have a fair 
and adequate system of objectives and incentives, 24 - Being rewarded for overtime 
work, 25 - Being pressured to achieve the proposed objectives, 26 - Ability to handle 
pressure at work, and 27 - Appropriate training to the professional functions. 

For each item, respondents could pick only one of six modalities of response 
according to their level of agreement or disagreement with the items that assess 
motivation: Totally disagree; Disagree most of the time; Slight disagree; Slight 
agree; Agree most of the time, and Totally agree. In this study, Categorical Principal 
Components Analysis (CatPCA), using the Varimax rotation method with Kaiser 
Normalization; and some agglomerative hierarchical clustering algorithms (AHCA) 
were used. Data analysis was performed using the packages IBM SPSS Statistics 26 
and CLUSTI1 [19]. 

Principal Components Analysis (PCA) aims to reduce the dimensionality of the 
original data so that "the first few dimensions account for as much of the available 
information as possible" ([9], p. 83), assuming linear relationships among numeric 
variables. Each principal component is uncorrelated with all others, and it is ex- 
pressed as a linear combination of the original variables. CatPCA optimally quanti- 
fies categorical (ordinal or nominal) variables and can handle and discover nonlinear 
relationships between variables (e.g., [12]). In the present study, we applied the 
CatPCA due to the ordinal nature of the items under analysis. 

The goal of a clustering algorithm is to obtain a partition, where the elements 
within a cluster are similar and elements (objects/individuals/groups of individuals or 
variables) in different groups are dissimilar, identifying natural clustering structures 
in a data set (e.g., [8]). Agglomerative clustering algorithms usually start with each 
element to sort into its own separate cluster of size 1 (singleton). At each step, 
the algorithms find the two “closest” clusters, taking into account the aggregation 
criterion, and join them. The process continues until a cluster containing all elements 
to classify is obtained. The AHCA of the set of items was based on the affinity 
coefficient as a measure of comparison between elements, combined with two classic 
(Single-Linkage ( SL) and Complete-Linkage (CL)) and a family of probabilistic VL 
(V for Validity, L for Linkage) aggregation criteria (e. g., [1, 2, 3, 10, 11, 16, 17, 18, 
22)). 

According to Ng et al. ([20], p. 849), “the task of finding good clusters has been 
the focus of considerable research in machine learning and pattern recognition“. 
However, the identification of the best partitions using validation indices is also 
of crucial importance. Therefore, a pertinent question arises: “How well does the 
partition fit the data?" ([8], p. 505). On what validation of results is concerned, the 
identification of the best partitions in the present study was based on the global 
level statistics, STAT [1, 10, 11]. The global maximum STAT value indicates the best 
cut-off level of a dendrogram and the local maxima STAT differences indicate the 
most significant levels. 

The affinity coefficient between two distribution functions was introduced by 
Matusita in 1951 (e.g., [13, 14]). Bacelar-Nicolau extended it to the non-supervised 
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classification field as a similarity measure between profiles. Let V be a set of p 
variables, describing a set D of N statistical data units (individuals), so that each of 
the N x p cells of the corresponding data table X contains one single non-negative 
real value x; (i = 1,..., N; k = 1,..., p) which denotes the value of the k-th variable 
on the i-th individual. The standard affinity coefficient a(k, k’) between a pair of 
variables, Vi. and Vi (k, k' = 1,..., p) is given by formula (1), where x. = EN d 


, Xik Xik’ 
a(k, k’) = EN Z (1) 
Xk Xk 


The coefficient (1) is a symmetric similarity coefficient which takes values in 
[0,1] (1 for equal or proportional vectors and 0 for orthogonal vectors). Note that 
its mathematical formula corresponds to the inner product between the square root 
column profiles associated with those variables and measures a monotone tendency 
between column profiles. In the particular case of binary variables, the affinity 
coefficient coincides with the well-known Ochiai coefficient. Furthermore (e.g., 
[4, 6]), it is related to the Hellinger distance d by the relation d? = 2(1 — a), which 
has been used in the context of spherical factor analysis by Michel Volle. Later on, 
the standard affinity coefficient was extended to the clustering of statistical data units 
or variables, mainly in a three-way approach (e.g., [3, 4, 5, 6]). The computation of 
the standard affinity coefficient between individuals can be performed by previously 
transposing the data matrix and then applying formula (1). 

The probabilistic aggregation criteria on the scope of VL methodology can be 
interpreted as distribution functions of statistics of independent random variables, 
that are i.i.d. uniform on [0,1] (e.g., [3, 17]). The SL aggregation criterion can lead 
to very long clusters (chaining effect). On the other hand, the AVL (Aggregation 
Validity Link) has a tendency to form equicardinal clusters with an even number of 
elements. The comparison functions between a pair of clusters, A and B, concerning 
the family I of AVL methods can be generated by the following conjoined formula 
(e.g., [17, 10, 11]: 


X= EN b 


I'(A, B) = (pap) P? Q) 


with a = Card A, B = Card B, pap = max[yap : (a,b) € (Ax B], with 
1 < g(a, B) < aß, and yxy is a similarity measure between pairs of elements, x and 
y, of the set of elements to classify (e.g., g(@, 8) = 1 for SL, g(a, B) = af for AVL). 
Note that varying g(o,) with 1 < g(o,B) < af, a sort of compromise can be 
built between SL and AVL methods (e.g., g (œ, B)2(a + B)/2 for AVI). Thus, (A, B) 
will be “more polluted by the chain effect when g(a, 8) remains near 1, and more 
contaminated by the symmetry effect as long as g(a, B) is in the neighbourhood of 
a" ( [17], p. 95). Among the criteria that establish a compromise between AVL and 
SL methods, stands out the AVZ method, whose behavior is very similar to that of 
AVL and often provides, at its cut-off level, a partition better adjusted to the preorder 
than the "best" classification obtained by AVL. 
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3 Main Results and Discussion 


Concerning the CatPCA, the best solution comprises four principal components, 
and the percentage of variance accounted for (PVAF) across these components is 
almost 70% (about 69%) of the data's total variance. All extracted components have 
eigenvalues above |. Moreover, the first three main components have a very good 
internal consistency and the fourth component has an acceptable internal consistency, 
as shown by the values of the Cronbach’s Alpha coefficient (see Table 1). 


Table 1 Rotated component loadings of the 4-component solution - Motivational factors. 


Items PCI PC2 PC3 PC4 
MI 0.213 0.351 0.699 0.166 
M2 0.197 0.044 0.794 0.211 
M3 0.248 0.148 0.763 -0.018 
M4 -0.028 0.098 0.482 0.442 
M5 0.354 0.219 0.674 0.037 
M6 0.522 0.214 0.425 0.095 
M7 0.837 0.110 0.193 -0.114 
M8 0.774 0.151 0.244 0.099 
M9 0.778 0.227 0.183 -0.125 
MIO 0.783 0.269 0.227 -0.043 
MII 0.757 0.259 0.223 -0.103 
MI2 0.798 0.155 0.227 -0.035 
M13 0.708 0.213 0.341 0.070 
M14 0.486 0.511 0.372 -0.257 
M15 0.775 0.263 0252 0.041 
M16 0.432 0.364 0.665 0.035 
M17 0.289 0.708 0.410 -0.046 
M18 0.462 0.641 0.097 -0.247 
M19 0.548 0.532 0.211 -0.034 
M20 0.503 0.609 0.074 -0.223 
M21 0.684 0.401 0.070 0.074 
M22 0.678 0.399 0.019 0.054 
M23 0.295 0.770 0.284 0.102 
M24 0.174 0.835 0.176 -0.011 
M25 0.019 -0.012 0.233 0.864 
M26 -0.038 -0.146 0.035 0.896 
M27 0.543 0.458 0.230 0.227 
Eigenvalue (VAF) 7.988 4.417 4.066 2.138 
Percentage accounted (PVAF) 29.59 16.36 15.06 7.92 
Cronbach's Alpha 0.950 0.934 0.919 0.610 


The most important items for the first dimension are items M6, M7, M8, M9, 
M10, M11, M12, M13, M15, M19, M21, M22, and M27, which are related to human 
relationships/interactions with colleagues and hierarchical superiors, so it is called 


368 Á. Sousa et al. 


"Psychological well-being/Interpersonal relationships". This dimension explains the 
highest proportion of data variance (29.59%). 

Concerning the second dimension, the items M14, M17, M18, M20, M23, and 
M24 are the most important, so this dimension was designated “Remuneration, 
job stability and incentive system". The most relevant items regarding the third 
dimension are M1, M2, M3, M4, M5, and M16; so, this dimension was called 
"Career progression/Professional achievement". Finally, the most important items 
for the fourth dimension are M25 and M26 related to “Fulfilment of the proposed 
objectives and the timings to achieve them". 

Regarding the AHCA of the same set of items, and considering the best cut-off 
levels, the results of the present study are summarized in Table 2. 


Table 2 The best partition - Standard affinity coefficient. 


Method The best partition STAT | Cut-off 
level 


SLICL . (M1, M2, M3, M5, M8, MIO, M11, M12, M13, MIS, M14, 15.8858 20 
M16, M18, M19, M22, M20, M6, M23, M27, M24, M21}; 
{M4}; {M9}; {M7}; {M25}; {M26}; {M17} 


AVI (M1, M2, M3, M6, M27, M21, M5, M23, M24, M8, M15, 15.6490 22 
M14, M16, M10, M13, M11, M12, M18, M19, M20, M22}; 
(M4, M25, M26); (M7); {M9}; {M17} 


According to the STAT values, the best partitions were obtained by the classic 
SL/CL and the probabilistic AV/ methods (see Table 2). All dendrograms highlighted 
four main branches, which are associated with different motivational factors ("Career 
progression"; "Psychological well-being / Interpersonal relationships"; "Organiza- 
tional environment and working conditions"; "Conformity with objectives and time 
to reach them"), bringing new information, and identifying some singletons, as 
shown in Figure 1. 


4 Conclusion 


Organizations and their leaders have become increasingly aware of the importance 
of their employees being well and that negative feelings can negatively affect pro- 
ductivity. Thus, it is essential to ensure the well-being of employees, taking into 
account the main motivational factors identified in this study. CatPCA made it pos- 
sible to extract four principal components (dimensions), which explain almost 70% 
of the total variance of the data, which were designated, respectively, by “Psy- 
chological well-being/Interpersonal relationships"; “Remuneration, job stability and 
incentive system"; “Career progression/Professional achievement"; and “Fulfilment 
of objectives and timings to achieve them". Regarding the AHCA of the items that 
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Fig. 1 Dendrogram - Standard affinity coefficient + AV/. 


assess motivation, the dendrograms highlight four main branches, which are associ- 
ated with different motivational factors called "Career progression"; "Psychological 
well-being / Interpersonal relationships"; "Organizational environment and working 
conditions"; and "Conformity with objectives and time to reach them". They carried 
new information and identify some singletons as well. Comparing the dendrograms, 
we conclude that the clusters referring to the best partitions are quite similar, with 
observed differences mainly concerning the few singletons. Moreover, the effec- 
tive and fruitful correspondence between the AHCA and the CatPCA results may 
help to better understand the main types of factors identified. In fact, the four main 
branches of all dendrograms are related to motivational factors which corresponding 
interpretation are in consonance with those identified through CatPCA. 
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A Proposal for Formalization and Definition of 
Anomalies in Dynamical Systems 


Jan Michael Spoor, Jens Weber, and Jivka Ovtcharova 


Abstract Although many scientists strongly focus on anomaly detection in different 
applications and domains, there currently exists no universally accepted definition 
of anomalies and outliers. Using an approach based on control theory and dynamical 
systems, as well as a definition for anomalies as described by philosophy of science, 
the authors propose a generalized framework viewing anomalies as key drivers 
of progress for a better understanding of the dynamical systems around us. By 
mathematically defining anomalies and delimiting deviations within expectations 
from completely unforeseen instances, this paper aims to be a contribution to set up 
a universally accepted definition of anomalies and outliers. 


Keywords: anomaly detection, outlier analysis, dynamical systems 


1 Introduction 


Anomalies, often interchangeably called outliers [1], are of key interest in explorative 
data analysis. Therefore, anomaly detection finds application in many different sci- 
entific fields, i.e., in social science, economics, engineering, and medical science [2]. 
In particular, research in these domains regarding databases, data mining, machine 
learning or statistics focuses strongly on anomaly detection [3]. Despite the wide 
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range of anomaly detection, there is currently no universally accepted definition of 
what an outlier or anomaly is [2], and the mathematical definition depends on the 
selected method to find these anomalies [4]. 

The authors previously proposed an applied framework to formalize anomalies 
within the context of control theory and dynamical systems [5]. In this publication, 
the idea is discussed in more depth, and a generalization of the framework is proposed 
to extend its application area to more domains since dynamical systems are relevant 
in engineering and science [6] as well as in management science and economics [7]. 
Furthermore, the proposed definition of anomalies should also be applicable outside 
of the context of control theory and aims to be a contribution to set up a universally 
accepted definition of anomalies and outliers. 

When controlling or simulating dynamical systems, a measurement and prediction 
process is used. Anomalies occur in this process as substantial deviations of a 
measured system state (an actual value) from an expected system state (a planned 
value) [5]. Despite simulation and planning effort, these deviations still occur. While 
some deviations fall within an acceptable range and within the expectations of normal 
system behavior, other anomalies are completely unforeseen and do not fit the set-up 
and expectations of the system. Three sequential questions are derived to further 
investigate the nature of anomalies within dynamical systems: 


1. What distinguishes unforeseen system states from regular system behavior? 

2. How can unforeseen system states or errors occur despite simulation? 

3. How can unforeseen system states be analyzed and transferred to a standard 
model of a system’s behavior? 


2 Definition of Anomalies for Dynamical Systems 
2.1 Definitions of Anomalies and Outliers 


In general, it is assumed that anomalies are somehow visible within the data of 
the observed systems. This is also clearly stated by the definition of an outlier or 
anomaly as data points with a substantial deviation from the norm since this requires 
a normal state of the system and a measurable deviation [8]. Furthermore, the 
anomaly detection requires existence and knowledge of a normal state, a definition 
of a deviation, a metric, and a threshold measure of distance. This threshold measure 
of distance uses the selected metric. All distances between the norm and the data 
points, which are either above (in case of distance measures) or below (in case of 
similarity measures) the defined threshold, are assumed to be non-substantial. 
Therefore, in addition, the selection of an appropriate metric becomes an impor- 
tant tool to accurately describe an anomaly. Some authors claim that, in a practical 
application, the selection of a suitable metric might be more important than the 
algorithm itself. For example, if clusters are clearly separated within the examined 
dataset in context of the selected metric, clusters will be found independently of 
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the used method or algorithm [9]. Other authors claim that the selected method for 
investigating clusters is of importance [10]. 

To summarize, there is no trivial definition of a normal state, a deviation, and when 
a deviation might be substantial. Some authors therefore describe the usefulness of 
an analysis only within the context of the goals of the analysis [11]. Outlier detection 
becomes more of a technical target than an actual scientific finding of something 
novel since the novelty is always defined within the technical target of the analysis. 
Alternatively, the normal model of the data defines an anomaly [1]. 

This results, for example, in approaches of regression diagnostics to exclude 
outliers and anomalous data prior to an analysis or to conduct the analysis along the 
standard model in a more robust way, which is less affected by anomalies [12]. Both 
approaches result in the maintaining of the normal model using anomalies as if they 
were less adequate or not at all representative of the data set. 

Since anomalies are only relevant within a context, a typology of anomalies within 
different dataset contexts can be created. Thus, Foorthuis [13] proposes a typology 
along the following dimensions: types of data (qualitative, quantitative or mixed), 
anomaly level (atomic or aggregated) and cardinality of relationship (univariate or 
multivariate). Anomalies are, within this kind of typology, always dependent on 
the dataset and behave differently along the measured features, which have been 
classified as relevant for the specific analysis. The anomaly detection becomes a 
detection of unfitting, surprising values while maintaining the normal model. 


2.2 Definition by Philosophy of Science 


If the assumptions regarding normal states, deviation, and substantiality are dropped, 
it is possible to discuss anomalies on a more fundamental level for understanding 
our surroundings and the observations of them. 

To do this, anomalies have to be placed in the historic context of science and 
research. Since anomaly detection as a discipline of data science is placed within 
the scientific context [14], anomaly detection can also be analyzed as part of the 
scientific method and therefore a comparison with the historical understanding of 
anomalies in the context of science becomes relevant. By definition of Kuhn [15], 
anomalies play an important role in the scientific discovery of novelties: 


Discovery commences with the awareness of anomaly, i.e., with the recognition that nature 
has somehow violated the paradigm-induced expectations that govern normal science. It 
then continues with a (...) exploration of the area of anomaly. And it closes only when the 
paradigm theory has been adjusted so that the anomalous has become the expected. 


This statement describes scientific progress as a stepwise discovery and the place- 
ment of anomalies within a normal state by science. The discussed normal state is 
therefore dictated by current scientific knowledge, which encompasses the predic- 
tions of the currently available and widely used models and theories. An anomaly 
violates the normal state by violating the predictions of these models. The steps of 
scientific progress are then as follows: 
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]. Knowledge of the anomaly. 

2. Stepwise acknowledgement of observations and conceptual nature of the 
anomaly. 

3. Change of paradigm and methods to include the anomaly in the new models, 
often under resistance by the scientific community itself. 


Therefore, different states of an anomaly exist as follows: 


1. The anomaly is completely unknown. 
2. The anomaly is neither described nor modeled but was observed. 
3. Theanomaly is not commonly recognized and placed within the standard model. 


The states of anomalies correspond to the initially defined questions in the in- 
troduction regarding the delimitation of anomalous states from normal states, the 
exploration of the causes for anomalies, and the modeling and planning with the 
now known anomalies. If the states of anomalies are used to describe practical errors 
in engineering, error states of systems are not anomalies. This is the case because 
if error states are priorly classified as such, they are therefore already known and 
described. This corresponds to the idea that outliers or anomalies are created by a 
different underlying mechanism [16] and therefore imply an unknown system behav- 
ior, which needs modeling to better describe the system. In addition, this follows the 
assumption of a normal state in which anomalies simply derive from a normal model 
[1] since they are not part of the normal model. Also, this idea relates strongly to the 
discussion of the relation between novelty and anomaly detection [17]. 

To follow the definitions by Kuhn [15], science is driven by internal progress, lim- 
ited by the current methods and available resources, while external targets, defined by 
stakeholders, e.g., society or companies, drive technicians. This description matches 
the idea that the usefulness of an analysis should be evaluated within the context of 
its goals [11] and distinguishes two types of anomalies: "Scientific" anomalies of a 
novel observation and "technical" anomalies as deviations from a predefined norm 
using a predefined measurement of substantiality. 

"Scientific" anomalies might still result in unwanted system states, which then 
can result in some kind of error or critical system state. Nevertheless, not every 
"scientific" anomaly inevitably results in an error state and not every error state is 
a "scientific" anomaly. An anomaly is not a "scientific" anomaly if the error state 
is already documented or can be described by the standard model. In this case, the 
anomaly becomes a "technical" anomaly. 

Using the philosophy of science definition of anomalies, the normal state is the 
prediction by the system model, the deviation is the difference between the prediction 
of the system state and the measured actual state of the system, and the substantiality 
is defined by the noise and precision of our predictions and measurement tools. 


3 Proposed Framework for a Formalization of Anomalies 


To separate "scientific" and "technical" anomalies, a formerly proposed framework 
[5] is generalized as illustrated in Fig. 2. and mathematically defined in this section. 
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Fig. 1 Formalization of "scientific" and "technical" anomalies and system states. 


Definition 1 (System State) There exists a multivariate description x; of a state i 
with a finite number of features. For each feature j of state i a value x;; exists, which 
is a realization of the feature space R;. The value x;; is the actual and precise state 
description of feature j at state i. Although there exists only a single true value 
Xij, the value itself does not necessarily have to be a single data point but can be a 
multivariate or symbolic data value and can be of any data type. 


Vi Yj Al xij, xy € Rj (1) 
The set C of all combinations of system state values with J features is given by: 
C-(x|Vij3dxj;j€Rjj = R1 X... X RJ (2) 


Definition 2 (Operation) An operation is an analytical function f which changes 
the system state from state i to the following state i+ 1. Both states belong to the set 
of all combinations of system states C. 


f:C€—0C,  f(xi) = xis1 (3) 


There exists a finite set F of functions of endogenous state transformations. This 
set of functions is the scope of operations that can be performed. These functions 
are the fundamental functionality of a system, which can be performed without any 
external involvement. For all functions the following expression is applied: 


geF^feF:gofeF (4) 


Using the defined function space, a restriction of reachable system states via all 
functions from F is defined, resulting in the set of physically possible system states. 


Definition 3 (Physically Possible System States) The relation f spans the complete 
space of state changes of a system using the entire scope of operations. The resulting 
space is the set of all possible system states. The physically possible system states 
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are the possible realizations of x; based on a starting point and if only functions from 
F are applied. The set P is a group with a neutral element of operations. 


P-z(x|VfeF:f(x)eP)y cc (5) 


Definition 4 (Observed System States) Of the amount J of existing features of the 
system state, only an amount D of features is known with D x J. Since not all 
system states can be measured, a function z transforms the real system states and 
real operations of the system into observable system states and operations. 


z:C— M, z(xi) 2 xr (6) 


Therefore, the set M = R; X ... X Rp is the space of all observable and known 
system states. Function z is the measurement process. 


Definition 5 (Observed Operations) Not all functions of the whole set of function 
F are known or observable when planning and operating a system. 


peg (7) 


Additionally, only observable system states are modeled when operating a system. 
The observed operations of systems are therefore projections of a subsets of known 
operations of F and operate within the observed and known system states. 


F* = 2(F’) (8) 


The actual conducted operations f are always from the set of operations F, but the 
expectation and prediction utilize, due to lack of system knowledge, only f* € F*. 


f :M >M, f(x)-7xur (9) 
Therefore, all states applied in operation f" are defined as expected system states. 


Definition 6 (Expected System States) The system states, which are possible if 
only the observed and known operations of the set F* are applied to all system states 
xj» € E, are the expected system behavior. 


E = {xp | Yf* € F* : f(x) eE) CM (10) 


The expected system states can be further split into desired system states, where 
the system is running most beneficially for its usage, a critical system state, where 
a possible error or rare system states are measured, and error states, which are 
system faults with operational risks involved as defined by Basel III [18]. Applied 
in engineering, this definition is compatible with the definition of DIN EN 13306 
since the system is at risk of being unable to perform a certain range of functions 
without necessarily being completely inoperable [19]. All kinds of errors, warnings 
and non-beneficial system states are the "technical" anomalies within the contextual 
analysis of the data set. 


A Proposal for Formalization and Definition of Anomalies in Dynamical Systems 379 


Definition 7 (Unforeseen System States) The set of unforeseen system states U are 
therefore all measurable system states within the realm of observable system states 
but not within the expected system states: 


U=M/E (11) 


"Scientific" anomalies in unforeseen system states are measured if the real oper- 
ation f differs from f* such that a prediction error occurs: 


Sr) EE, Fx) * z(fGu) €E (12) 


"Scientific" anomalies are part of the unforeseen system states. Another reason for 
unforeseen system states is a measurement of an impossible system state. Anomalies 
originated by physically impossible system states are to be distinguished from "scien- 
tific" anomalies since the reason for their occurrence follows a different mechanism. 
Thus, they are assigned to the "technical" anomalies. 


Definition 8 (Physically Impossible System States) Physically impossible system 
states / are combinations of states in set C which are not reachable using function f: 


I=C/P (13) 


Definition 9 (External Influence) Applying changes to the system, the feature 
space also changes. Consequently, the space of the physically possible system states 
changes. Previously impossible system states become possible system states. 


Definition 10 (Faulty Data Points) If a measurement is conducted incorrectly, the 
measured values could be within the impossible system states. Faulty data points are 
therefore neither measurement noise nor imprecision, but should be systematically 
excluded. Note that faulty data points could be within the possible system space but 
need to be excluded either way. 


4 Conclusion 


Itis concluded that the anomaly concept is often loosely defined and heavily depends 
on assumptions of a normal state, deviation, and substantiality. These definitions are 
often case-specific and influenced by the conducting researchers' choice. Therefore, 
a rigorous definition of anomalies is capable of further streamlining the discourse 
and increasing a common understanding of what kind of anomaly is described. 
Using "technical" and "scientific" anomalies, further research will be conducted 
to set up models detecting both types of anomalies separately. Differences between 
Observed and real system states and operations are a focus of further research to 
more precisely analyze the hidden processes of the "scientific" anomaly generation. 
Also, a more fundamental discussion of the philosophical definition of anomalies 
within the philosophy of science and its applications to anomaly detection in general 
should be conducted to further gain insight into the true nature of anomalies. 
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The authors plan to validate the concept by using the proposed definition and 
framework in exemplary applications within industrial processes. Furthermore, 
anomaly detection methods designed for applications in dynamical systems using 
the proposed framework are planned to be developed. 
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New Metrics for Classifying Phylogenetic Trees 
Using K-means and the Symmetric Difference 
Metric 


Nadia Tahiri and Aleksandr Koshkarov 


Abstract The k-means method can be adapted to any type of metric space and is 
sometimes linked to the median procedures. This is the case for symmetric difference 
metric (or Robinson and Foulds) distance in phylogeny, where it can lead to median 
trees as well as to Euclidean Embedding. We show how a specific version of the 
popular k-means clustering algorithm, based on interesting properties of the Robin- 
son and Foulds topological distance, can be used to partition a given set of trees into 
one (when the data is homogeneous) or several (when the data is heterogeneous) 
cluster(s) of trees. We have adapted the popular cluster validity indices of Silhouette, 
and Gap to tree clustering with k-means. In this article, we will show results of this 
new approach on a real dataset (aminoacyl-tRNA synthetases). The new version of 
phylogenetic tree clustering makes the new method well suited for the analysis of 
large genomic datasets. 


Keywords: clustering, symmetric difference metrics, k-means, phylogenetic trees, 
cluster validity indices 


1 Introduction 


In biology, one of the most significant organizing principles is the "Tree of Life" 
(ToL) [12]. In genetic studies, there is evidence of an enormous number of branches, 
but even a rough estimate of the total size of the tree remains difficult. Many recent 
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representations of ToL have emphasized either the existence of deep evolutionary 
relationships [7] or the knowledge of a large and diverse variety of life, with an 
emphasis on Eukaryotes [8]. These approaches do not consider the dramatic evolution 
in our understanding of the diversity of life due to genomic sampling of previously 
unexplored environments. 

As a result, Maddison in 1991 [11] was the first to formulate the idea of multiple 
consensus trees when he described his phylogenetic island method. He observed 
that island consensus trees can differ significantly from each other and are generally 
better resolved than the species-wide consensus tree. The most intuitive approach to 
discovering and clustering genes that share similar evolutionary histories is to cluster 
their genetic phylogenies. In this context, Stockham et al. in 2002 [18] proposed 
a tree clustering algorithm based on k-means [4, 9, 10] and the Robinson and 
Foulds quadratic distance [15]. Their clustering algorithm aims to infer a set of 
strict consensus trees, minimizing information loss. They proceed by determining 
the consensus trees for each set of clusters in all intermediate partitioning solutions 
tested by k-means. This makes the Stockham et al. algorithm very expensive in 
terms of execution time. More recently, Tahiri et al. in 2018 [19] proposed a fast and 
accurate tree clustering method based on k-medoids. Finally, Silva and Wilkinson 
in 2021 [17] introduced a revised definition of tree islands based on any tree-to-tree 
metric that usefully extends this notion to any set or multiset of trees and provided 
an interesting discussion of biological applications of their method. 

In this context, the use of a method that infers multiple supertrees (i.e., a supertree 
clustering method) would help discover and cluster alternative evolutionary scenarios 
for several ToL subtrees. 

The paper is structured as follows. In the next section, we introduce a new metric 
for k-means algorithm based on the Robinson and Foulds distance. The section 
3 presents the simulation results (on a real dataset) obtained with our algorithm 
compared to other clustering methods. Finally, we discuss our contributions in section 
4. 


2 Methods 


The k-means algorithm [9, 10] is a very common algorithm for data parsing. From 
a set of N observations x;,...,xy each one being described by M variables, this 
algorithm creates a partition in k homogeneous classes or clusters. Each observation 
corresponds to a point in a M-dimensional space and the proximity between two 
points is measured by the distance between them. In the framework of k-means, the 
most commonly used distances are the Euclidean distance, Manhattan distance, and 
Minkowski distance [4]. To be precise, the objective of the algorithm is to find the 
partition of the N points into k clusters in such a way that the sum of the squares of the 
distances of the points to the center of gravity of the group to which they are assigned 
is minimal. To the best of our knowledge, finding an optimal partition according to 
the k-means least-squares criterion is known to be NP-hard [13]. Considering this 
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fact, several polynomial-time heuristics were developed, most of which have the time 
complexity of O(KNIM ) for finding an approximate partitioning solution, where 
K is the maximum possible number of clusters, N is the number of objects (for 
example, phylogenetic trees), J is the number of iterations in the k-means algorithm, 
and M is the number of variables characterizing each of the N objects. 

A well-known metric of comparing two tree topologies in computational biology 
is the Robinson-Foulds distance (RF), also known as the symmetric-difference dis- 
tance [15]. Moreover, the distance RF is a topological distance, which means that 
it does not consider the length of the edges of the tree. The formula of RF distance 
can be describe as (ni(T]) + n2(T2)), where nj(T|) is the number of partitions of 
data implied by the tree 71, but not the tree T? and n2 (T2) is the number of partitions 
of data implied by the tree 75 but not the tree 71. According to Barthélemy and 
Monjardet [1], the majority-rule consensus tree of a set of trees is the median tree of 
this set. This fact makes the use of tree clustering possible. 


2.1 Silhouette Index Adapted for Tree Clustering 


The first popular cluster validity index we consider in our study is the Silhouette 
width (SH) [16]. Traditionally, the Silhouette width of the cluster k is defined as 
follows: 


Nk , . 
=] Teo a) 


Nx = max(a(i), b(i)) 


where N; is the number of objects belonging to cluster k, a(i) is the average distance 
between object i and all other objects belonging to cluster k, and b(i) is the smallest, 
over all clusters k’ different from cluster k, of all average distances between i and all 
the objects of cluster k’. 

We used Equations (2) and (4) for calculating a(i) and b(i), respectively, in 
our tree clustering algorithm (see also [19]). For instance, the quantity a(i) can be 
calculated as follows: 
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where N; is the number of trees in cluster k, Ty; and 7; ; are, respectively, trees i 
and j in cluster k, n(Tz;) is the number of leaves in tree Tki, n(Ty;) is the number of 
leaves in tree 7, ;, and £ is a penalty function which is defined as follows: 


Min(n(Ty;), n(Tyj)) — (Tris Tkj) 
Min(n(Tk;), n(Tk;)) 


Enx 6) 


386 N. Tahiri and A. Koshkarov 


where a is the penalization (tuning) parameter, taking values between 0 and 1, used 

to prevent from putting to the same cluster trees having small percentages of leaves 

in common, and n(Ty;, Tkj) is the number of common leaves in trees Tki and T; ;. 
The formula for b(i) is as follows: 


XS RF (Tri, Tes) 


b(i) = 
Ve P" k'*k ME 2n(Tki, Ty; j) -6 


[Ne , (4) 


where Tw j is the tree j of the cluster k’, such that k’ + k, and Ny is the number of 
trees in the cluster k’. 

The optimal number of clusters, K, corresponds to the maximum average Silhou- 
ette width, SH, which is calculated as follows: 


K 
SH =3(K) =)" sco |/« (5) 
k=1 


The value of the Silhouette index defined by Equation (5) ranges from -1 to +1. 


2.2 Gap Statistic Adapted for Tree Clustering 


Itis worth noting that the SH cluster validity index (Equations (1) to (5)) do not allow 
comparing the solution consisting of a single consensus tree (K = 1; the calculation of 
SH is impossible in this case) with clustering solutions involving multiple consensus 
trees or supertrees (K > 2). This can be considered as an important disadvantage 
of the SH-based classifications because a good tree clustering method should be 
able to recover a single consensus tree or supertree when the input set of trees is 
homogeneous (e.g. for a set of gene trees that share the same evolutionary history). 

The Gap statistic was first used by Tibshirani et al. [20] to estimate the number of 
clusters provided by partitioning algorithms. The formulas proposed by Tibshirani 
et al. were based on the properties of the Euclidean distance. In the context of tree 
clustering, the Gap statistic can be defined as follows. Consider a clustering of N 
trees into K non-empty clusters, where K > 1. First, we define the total intracluster 
distance, Dg, characterizing the cohesion between the trees belonging to the same 
cluster k: 


1 | 2n(Tki, Tkj) - 6 (6) 
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Then, the sum of the average total intracluster distances, Vg, can be calculated 
using this formula: 


el 
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Finally, the Gap statistic, which reflects the quality of a given clustering solution 
including K clusters, can be defined as follows: 


Gapn(K) = Ey {log(Vx)} - log(Vx) . (8) 


where E^, denotes expectation under a sample of size N from the reference distri- 
bution. The following formula [20] for the expectation of log(Vk) was used in our 
algorithm: 

Ey { log(Vx)} = log(Nn/12) — (2/n) log(K) , (9) 


where n is the number of tree leaves. 
The largest value of the Gap statistic corresponds to the best clustering. 


3 Results - A Biological Example 


To illustrate the methods described above, we used a dataset from Woese et al. [22]. 
The aminoacyl-tRNA synthetases (aaRSs) are enzymes that attach the appropriate 
amino acid onto its cognate transfer RNA. The structure-function aspect of aaRSs 
has long attracted the attention of biologists [22, 6]. Moreover, the relationship of 
aaRSs to the genetic code is observed from the evolutionary view (the central role 
played by the aaRSs in translation would suggest that their histories and that of the 
genetic code are somehow intertwined [22]). The novel domain additions to aaRSs 
genes play an important role in the inference of the ToL. 

We encoded 20 original aminoacyl-tRNA synthetase trees from Woese et al. [22] 
in Newick format and then split some of them into sub-trees to account for cases 
where the same species appeared more than once in the original tree. Our approach 
cannot handle data that includes multiple instances of the same species in the input 
trees. Thus, 36 aaRS trees with different numbers of leaves (including 72 species 
in total) were used as input of our algorithm (their Newick strings are available at: 
https: //github.com/tahiri-lab/PhyloClust). Our approach was applied 
with the œ parameter set to 1. 

First, we implemented our new approach with the Gap statistic cluster validity 
index which suggested the presence of 7 clusters of trees in the data, thus suggesting a 
heterogeneous scenario of their evolution. Then, we conducted the computation using 
the SH cluster validity index and obtained 2 clusters of trees each of which could 
be represented by its own supertree. The first cluster obtained using SH included 19 
trees for a total of 56 organisms, whereas the second cluster included 17 trees for 
a total of 61 organisms. The supertrees (see Figure 1) for the two obtained clusters 
of trees were inferred using the CLANN program [5]. Further, we decided to infer 
the most common horizontal gene transfers which characterized the evolution of 
gene trees included in the two obtained tree clusters. The method of [3], reconciling 
the species and gene phylogenies to infer transfers, was used for this purpose. The 
species phylogenies followed the NCBI taxonomic classification. These phylogenies 
were not fully resolved (the species phylogeny in Figure 1a contains 9 internal nodes 
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Fig. 1 Nonbinary species tree corresponding to the NCBI taxonomic classification are represented 
with (a) 56 species for cluster 1. The 4 HGTs (indicated by arrows) were found by the S H index for 
the first cluster; (b) 61 species with œ equal 1 for cluster 2. The 2 HGTs (indicated by arrows) were 
found by the SH index with «œ equal 1 for the second cluster. We applied Most Similar Supertree 
Method (d f it) [5] implemented in CLANN Software with mr p criterion. This criterion is a 
matrix representation employing parsimony criterion. 


with a degree higher than 3 and the species phylogeny in Figure 1b contains 10 
internal nodes with a degree higher than 3). 

We used the version of the HGT (Horizontal Gene Transfer) algorithm available 
on the T-Rex web site [2] to identify the scenarios of HGT events that reconcile the 
species tree and each of the supertrees. We choose the same root between species 
trees and supertrees: the root which split Bacteria to the clade of Eukayota and 
Archaea. 

For the first cluster composed of 56 species, we obtained 40 transfers with 22 
regular and 18 trivial HGTs. Trivial HGTs are necessary to transform a non-binary 
tree into a binary tree. We removed the trivial HGTs and selected between regular 
HGTs. The non-trivial HGTs with low representation are most likely due to the tree 
reconstruction artefacts. In Figure 1a, we illustrated only those HGTs that are most 
represented in the dataset. 

We followed the same procedure for the second cluster composed of 61 species 
and obtained 42 transfers with 28 regular and 14 trivial HGTs that are not represented 
here. We selected only the most popular HGTs in the dataset. All other transfers are 
represented in Figure 1b. 

The transfers link of P. horikoshii to the clade of spirochetes (i.e. B. burgdorferi 
and 7: pallidum) was found by [3, 14]. The transfers of P. horikoshii to P. aerophilum 
were also found by [14]. These results confirmed the existing HGT of [3, 14]. 
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4 Discussion 


Many research groups are estimating trees containing several thousands to hundreds 
of thousands of species, toward the eventual goal of the estimation of the Tree of Life, 
containing perhaps several million leaves. These phylogenetic estimations present 
enormous computational challenges, and current computational methods are likely to 
fail to run even with datasets on the low end of this range. One approach to estimate a 
large species tree is to use phylogenetic estimation methods (such as maximum like- 
lihood) on a supermatrix produced by concatenating multiple sequence alignments 
for a collection of markers; however, the most accurate of these phylogenetic estima- 
tion methods are extremely computationally intensive for datasets with more than 
a few thousand sequences. Supertree methods, which assemble phylogenetic trees 
from a collection of trees on subsets of the taxa, are important tools for phylogeny 
estimation where phylogenetic analyses based upon maximum likelihood (ML) are 
infeasible. 

In this article, we described a new algorithm for partitioning a set of phylogenetic 
trees in several clusters in order to infer multiple supertrees, for which the input trees 
have different, but mutually overlapping sets of leaves. We presented new formulas 
that allow the use of the popular Silhouette and Gap statistic cluster validity indices 
along with the Robinson and Foulds topological distance in the framework of tree 
clustering based on the popular k-means algorithm. The new algorithm can be used 
to address a number of important issues in bioinformatics, such as the identification 
of genes having similar evolutionary histories, e.g. those that underwent the same 
horizontal gene transfers or those that were affected by the same ancient duplication 
events. It can also be used for the inference of multiple subtrees of the Tree of Life. In 
order to compute the Robinson and Foulds topological distance between such pairs 
of trees, we can first reduce them to a common set of leaves. After this reduction, 
the Robinson and Foulds distance is normalized by its maximum value, which is 
equal to 2n — 6 for two binary trees with n leaves. Overall, the good performance 
achieved by the new algorithm in both clustering quality and running time makes it 
well suited for analyzing large genomic and phylogenetic datasets. A C++ program, 
called PhyloClust (Phylogenetic trees Clustering), implementing the discussed tree 
partitioning algorithm is freely available at https: //github.com/tahiri-lab/ 
PhyloClust. 
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On Parsimonious Modelling via Matrix-variate t 
Mixtures 


Salvatore D. Tomarchio 


Abstract Mixture models for matrix-variate data have becoming more and more 
popular in the most recent years. One issue of these models is the potentially high 
number of parameters. To address this concern, parsimonious mixtures of matrix- 
variate normal distributions have been recently introduced in the literature. However, 
when data contains groups of observations with longer-than-normal tails or atypi- 
cal observations, the use of the matrix-variate normal distribution for the mixture 
components may affect the fitting of the resulting model. Therefore, we consider a 
more robust approach based on the matrix-variate t distribution for modeling the 
mixture components. To introduce parsimony, we use the eigen-decomposition of the 
components scale matrices and we allow the degrees of freedom to be equal across 
groups. This produces a family of 196 parsimonious matrix-variate t mixture mod- 
els. Parameter estimation is obtained by using an AECM algorithm. The use of our 
parsimonious models is illustrated via a real data application, where parsimonious 
matrix-variate normal mixtures are also fitted for comparison purposes. 


Keywords: matrix-variate, mixture models, clustering, parsimonious models 


1 Introduction 


The matrix-variate model-based clustering literature is expanding more and more 
over the last few years, as confirmed by the high number of contributions using finite 
mixture models for the modelization of matrix-variate data [1, 2, 3, 4, 5, 6, 7, 8]. This 
kind of data is arranged in three-dimensional arrays, and depending on the entities 
indexed in each of the three layers, different data examples might be considered 
[9]. In many of these applications, we observe a p x r matrix for each statistical 
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observation. Thus, from a model-based clustering perspective, the challenge is to 
suitably cluster realization coming from random matrices. 

One problem of matrix-variate mixture models is the potentially high number of 
parameters. To cope with this issue, [5] have recently proposed a family of parsimo- 
nious mixtures based on the matrix-variate normal (MVN) distribution. Nevertheless, 
for many datasets, the tails of the MVN distribution are often shorter than required. 
This has several consequences on parameter estimation as well as in the proper data 
classification [4, 7]. Therefore, in this paper we relax the normality assumption of the 
mixture components by using (in a parsimonious setting) the matrix-variate t (MVT) 
distribution. The MVT distribution has been used within the finite mixture model 
paradigm by [10] in an unconstrained framework. Here, to introduce parsimony in 
this model, (i) we use the eigen-decomposition of the two scale matrices of each 
mixture component and (ii) we allow the degrees of freedom to be tied across the 
groups. This produces the family of 196 parsimonious matrix-variate MVT mixture 
models (MVT-Ms) discussed in Section 2. Parameter estimation is implemented by 
using an alternating expectation-conditional maximization (AECM) algorithm [12]. 
In Section 3, our parsimonious MVT-Ms, along with parsimonious matrix-variate 
MVN mixture models (MVN-Ms) for comparison purposes, are fitted to a Swedish 
municipalities expenditure dataset. The differences in terms of fitting among the two 
families of models are illustrated. The estimated parameters and the data partition 
of the overall best fitting model are also commented. Finally, some conclusions are 
drawn in Section 4. 


2 Methodology 
2.1 Parsimonious Mixtures of Matrix-variate t Distributions 


The probability distribution function (pdf) of a p x r random matrix X coming from 
a finite mixture model is 


G 
faut (X; Q) = 9 te f(X; Og), () 
g-l 
where 7g is the gth mixing proportion, such that 7, > 0 and py 17g = 1, f(X; Og) 
is the gth component pdf with parameter Og, and © contains all of the parameters 
of the mixture. In this paper, for the gth component of model (1), we adopt the MVT 
distribution having pdf 


| pr*vg 
2 


Eel Eel tr (255) 
fuvt(X; ©) = aT, 
(nv.)? r (4) 


ôo (X; M,, X,, Y, 
l4 g ( g Ug, Yo) O 
Yg 


On Parsimonious Modelling via Matrix-variate t Mixtures 395 


where ôg (X; M, Z,, V,) = tr [£3 (X - M); (X - Mj)']. M, is the p xr 
component mean matrix, X, is the p x p component row scale matrix, V, is ther xr 
component column scale matrix and vg > 0 is the component degree of freedom. 
It is interesting to recall that the pdf in (2) can be hierarchically obtained via the 
matrix-variate normal scale mixture model when the mixing random variable W is 
a gamma distribution with scale and rate parameters set to vg /2 [10]. Specifically, a 
hierarchical representation of MVT distribution can be given as follows 


1. W ~ G (ve/2, ve/2), 

2. X|W = w ~ N(Mg, &g/w, Fg), 
where G (-) is a gamma distribution and N (-) denotes the MVN distribution. This 
representation will be convenient for parameter estimation presented in Section 2.2. 

As discussed in Section 1, the mixture model in (1) may be characterized by a 

potentially high number of parameters. To address this concern, we firstly use the 
eigen-decomposition of the components scale matrices X, and Wy. In detail, we 
recall that a generic q x q scale matrix ®, can be decomposed as [11] 


9, - A,T, AT, (3) 
where Ag = |, |!/4, I, is a q x q orthogonal matrix whose columns are the 


normalized eigenvectors of ®,, and Ag is the scaled (|Ag| = 1) diagonal matrix of 
the eigenvalues of ®,. By constraining the three components in (3), the following 
family of 14 parsimonious structures is obtained: EIL, VII, EEI, VEI, EVI, VVI, 
EEE, VEE, EVE, VVE, EEV, VEV, EVV, VVV, where “E” stands for equal, ^V" 
means varying and "T" denotes the identity matrix. 

If we apply the decomposition in (3) to X, and W,, we obtain 14 x 14 = 196 
parsimonious structures. However, to solve a well-known identifiability issue related 
to the scale matrices of matrix-variate distributions [1, 3, 5], we impose the restriction 
[W,| = 1, which makes the parameter A, unnecessary, and reduces the number of 
parsimonious structures related to V, from 14 to 7: I, EL VI, EE, VE, EV, VV. 
Thus, we have 14x 7 = 98 parsimonious structures for the component scale matrices. 

To further increase the parsimony of model (1), we also consider the option 
of constraining the component degrees of freedom v4. The nomenclature used is 
the same to that adopted for the scale matrices. This option, combined with that 
discussed above for the scale matrices, allows us to produce a total of 98 x 2 = 196 
parsimonious MVT-Ms. 


2.2 An AECM Algorithm for Parameter Estimation 


To estimate the parameters of our family of mixture models, we implement an AECM 
algorithm. By using the hierarchical representation of Section 2.1, our complete data 
are S. = (Xi,zi, wil where Z; = (zii,..., zig), such that zi; = 1 if observation 
i belongs to group g and Zig = 0 otherwise, and w; is the realization of W. Therefore, 
the complete-data log-likelihood can be written as 
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fe (Q; Sc) = bic (0; Sc) + bre (E; S.) + Gc (0; Se), (4) 
where 
N G 
bie (x; S.) E a bier In (zg) , 
i=l g-l 


N G 
p 
be (E; Se) = ye ;[-5 Fin (2x) + In (wie) - zn, - Pip, 


(5) 


= Uigóg (X; M;,,X 2> Y.) | 
2 E 


bc (9:8,) = NET n (£) -mfr (E) + (2 - 1) m (wig) - us). 


i=l g-l 


ae 


with x = [14] 7.,, E = {Mg Es, Ye}; and 0 = {ve} S 
g-l g-l 
Our AECM algorithm then procedis as follows (notice that, the parameters 
marked with one dot are the updates of the previous iteration, while those marked 
with two dots are the updates at the current iteration): 


E-step At the E-step we have to compute the following quantities 


tty fuv (Xi; Og) 
EL tn fue (Xi; On) 


pr t Yg 


: ——————. (6 
Ve + Og (Xi Mg, Èg, Yo) = 


Zig = and Wig = 


There is no need to compute the expected value of In (Wig), given that we do not 
use this quantity to update vz. 
CM-step 1 At the first CM-step, we have the following updates 


N « 
Liat Žig 


NÉ eis a 
. 2 Èi- ZigligXi 


N A6 
Liz ČigÜig 

Because of space constraints, we cannot report here the updates of each par- 
simonious structure related to Z, and V. However, they can be obtained by 


generalizing the results in [5]. The only differences consist in the updates of the 
row and column scatter matrices of the gth component, that are here defined as 


<igWig (X; E Mg) v (X; = M) ? 


fig Wig (Xi E M;) X7! (Xi = Mg) 3 


N 
b» 
i=l 

N 
b» 
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CM-step2 At the second CM-step, we firstly define the "partial" complete-data 
log-likelihood function according to the following specification 


N G 
fpc (Q; Spc) eset T, Spc) 239 zig In fuvr(Xi; Og), (7) 


i=] g-l 


where “partial” refers to fact that the complete data are now defined as Spe = 
(Xi, E Then, Ÿş is determined by maximizing 


N N G 
» Ziglnfuvr(X;i; Og) or p z Žig In fuvr(Xi; Og), 


i-l i=] g=l 


over vg € (0,100), depending on the parsimonious structure selected, i.e. V or 
E, respectively. Notice that, an higher upper bound could also have been selected 
for the maximization problem but, with the already chosen value, the differences 
between an estimated MVT distribution and the nested MVN distribution would 
be negligible. Furthermore, when a heavy-tailed distribution approaches to nor- 
mality, the precision of the estimated tailedness parameters is unreliable [4]. 


3 Real Data Application 


Here, we analyze the Municipalities dataset contained in the AER package [13] 
for the R statistical software. It consists of expenditure information for N = 265 
Swedish municipalities over r = 9 years (1979-1987). For each municipality, we 
measure the following p = 3 variables: (i) total expenditures, (ii) total own-source 
revenues and (iii) intergovernmental grants received. 

We fitted parsimonious MVT-Ms and MVN-Ms for G € {1,2,3, 4,5} to the data, 
and for each family of models the Bayesian information criterion (BIC) [14] is used 
to select the best fitting model. According to our results, we found that the best among 
MVN-Ms has a BIC of -82362.61, a VVV-EE structure and G = 4 groups, while 
the best among MVT-Ms has a BIC of -82701.59, a VVE-EE-V structure and G = 3 
groups. Thus, the overall best fitting model is that selected for MVT-Ms. The MVN- 
Ms seem to overfit the data, given that an additional group is detected. This is not an 
unusual behavior, given that the tails of normal mixture models cannot adequately 
accommodate deviations from normality, and additional groups are consequently 
found in the data [4, 7, 15]. Anyway, the best fitting models of the two families agree 
in finding varying volumes and shapes in the components row scale matrices and 
equal shapes and orientations in the components column scale matrices. 

Figure 1 illustrates the parallel coordinate plots of the data partition detected by 
the VVE-EE-V MVT-Ms. The dashed lines correspond to the estimated mean for 
that variable, across the time, in that group. We notice that the first group contains 
municipalities having, on average, slightly higher expenditures, an intermediate 
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Fig. 1 Parallel coordinate plots of the data partition obtained by the VVE-EE-V MVT-Ms. The 
dashed lines correspond to the estimated means. 


level of revenues and higher levels of intergovernmental grants than the other two 
groups. Furthermore, it seems to cluster several outlying observations, as confirmed 
by the estimated degree of freedom v, = 3.75, which implies quite heavy tails 
for this mixture component. The second group shows the lowest average levels of 
expenditures and revenues, but a similar amount of received grants to that of the 
third group. Interestingly, this group does not presents many outlying observations, 
as also supported by the estimated degree of freedom vz = 10.95. Lastly, the third 
group has the highest levels of revenues but, as already said, it is similar to the other 
two groups in the other variables. Also in this case, we have a moderately heavy tail 
behavior given that the estimated degree of freedom is v3 = 6.05. 

To evaluate the correlations of the variables with each other and over time, for the 
three groups, we now report the correlation matrices Rọ.) related to the covariance 
matrices associated to X, and W,: 


1.00 0.48 0.14 1.00 0.55 0.18 1.00 0.73 0.22 
Rz, = [0.48 1.00 -0.06|,Rz,- (0.55 1.00 -—0.07|,Rx, = |0.73 1.00 -0.02|, 
0.14 —0.06 1.00 0.18 —0.07 1.00 0.22 —0.02 1.00 


1.00 0.80 0.72 0.67 0.65 0.59 0.58 0.55 0.52 
0.80 1.00 0.79 0.73 0.69 0.62 0.62 0.57 0.54 
0.72 0.79 1.00 0.80 0.73 0.69 0.66 0.63 0.60 
0.67 0.73 0.80 1.00 0.79 0.73 0.71 0.67 0.64 
Ry, = Ry, = Ry, = |0.65 0.69 0.73 0.79 1.00 0.83 0.80 0.73 0.71] . 
0.59 0.62 0.69 0.73 0.83 1.00 0.80 0.76 0.73 
0.58 0.62 0.66 0.71 0.80 0.80 1.00 0.81 0.78 
0.55 0.57 0.63 0.67 0.73 0.76 0.81 1.00 0.79 
0.52 0.54 0.60 0.64 0.71 0.73 0.78 0.79 1.00 


When Ry,, Ry, and Ry, are considered, we notice that, as it might be reasonable to 
expect, the correlations between total-expenditures and total-own source revenues 
or intergovernmental grants received are positive, despite they increase as we move 
from the first to the third group. Conversely, there exists a slightly negative correlation 
between total-own source revenues and intergovernmental grants received. However, 
there would be no great differences among the groups in this case. As concerns 
Ry, Ry, and Ry,, we observe that the correlation among the columns, i.e. between 
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time points, decreases as the temporal distance increases. Furthermore, considering 
the dimensionality of these column matrices, it is readily understandable the benefit, 
in terms of number of parameters to be estimated, of an EE parsimonious structure 
with respect to a fully unconstrained model. 

Finally, we analyze the uncertainty of the detected classification. This can be 
computed, for each observation, by subtracting the probability Zig of the most likely 
group from 1 [16]. The lower the uncertainty is, the stronger the assignment becomes. 
The quantiles of the obtained uncertainties can be used to measure the quality of 
the classification. In this regard, we noticed that 75% of the observations have an 
uncertainty equal or lower than 0.05. However, we observed a maximum value of 
0.50. This happens when groups intersect, since uncertain classifications are expected 
in the overlapping regions [17]. Relatedly, a more detailed information can be gained 
by looking at the uncertainty plot illustrated in Figure 2, which reports the (sorted) 
uncertainty values of all the municipalities. We see that the municipalities clustered 
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Fig. 2 Uncertainty plot for the Municipalities dataset. 


in the first group, excluding a couple of cases, have practically null uncertainties. 
This applies to a lesser extent to the municipalities in the other two groups, given 
the slightly higher number of exceptions. For example, there are 15 observations 
(approximately 5% of the total sample size) that have uncertainty values greater than 
0.3. However, and as said above, this is due to the closeness between the groups, 
which can be confirmed by looking at the parallel plots in Figure 1. 


4 Conclusions 


One serious concern of matrix-variate mixture models is the potentially high number 
of parameters. Furthermore, many real data requires models having heavier-than- 
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normal tails. To address both aspects, in this paper a family of 196 parsimonious 
mixture models, based on the matrix-variate t distribution, is introduced. The eigen- 
decomposition of the components scale matrices, as well as constraints on the com- 
ponents degrees of freedom, are used to attain parsimony. An AECM algorithm for 
parameter estimation has been presented. Our family of models have been fitted to a 
real dataset along with parsimonious mixtures of matrix-variate normal distributions. 
The results demonstrate the best fitting results of our models, and the overfitting ten- 
dency of matrix-variate normal mixtures. Lastly, the estimated parameters and data 
partition for the best of our models have been reported and commented. 
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Evolution of Media Coverage on Climate Change 
and Environmental Awareness: an Analysis of 
Tweets from UK and US Newspapers 


Gianpaolo Zammarchi, Maurizio Romano, and Claudio Conversano 


Abstract Climate change represents one of the biggest challenges of our time. 
Newspapers might play an important role in raising awareness on this problem and 
its consequences. We collected all tweets posted by six UK and US newspapers in 
the last decade to assess whether 1) the space given to this topic has grown, 2) any 
breakpoint can be identified in the time series of tweets on climate change, and 3) any 
main topic can be identified in these tweets. Overall, the number of tweets posted on 
climate change increased for all newspapers during the last decade. Although a sharp 
decrease in 2020 was observed due to the pandemic, for most newspapers climate 
change coverage started to rise again in 2021. While different breakpoints were 
observed, for most newspapers 2019 was identified as a key year, which is plausible 
based on the coverage received by activities organized by the Fridays for Future 
movement. Finally, using different topic modeling approaches, we observed that, 
while unsupervised models partly capture relevant topics for climate change, such 
as the ones related to politics, consequences for health or pollution, semi-supervised 
models might be of help to reach higher informativeness of words assigned to the 
topics. 
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1 Introduction 


Climate change is one of the biggest challenges for our society. Its consequences 
which include, among others, glaciers melting, warming oceans, rising sea levels, 
and shifting weather or rainfall patterns, are already impacting our health and im- 
posing costs on society. Without drastic action aimed at reducing or preventing 
human-induced emissions of greenhouse gasses, these consequences are expected 
to intensify in the next years. Despite its global and severe impacts, individuals may 
perceive climate change as an abstract problem [1]. It is also a well-known fact that 
the level of information plays a crucial role in the awareness about a topic (e.g. 
healthy food [2] and smoking [3]) . Media represent a crucial source of information 
and can exert substantial effects on public opinion, thus helping to raise the awareness 
on climate change. For instance, media can explain climate change consequences as 
well as portraying actions that governments, communities and single individuals can 
take. For this reason, it is important to distinguish themes that might have gained 
popularity from those that may have seen a decrease of interest. Nowadays, social 
media have become a reliable and popular source of information for people from 
all around the world. Twitter is one of the most popular microblogging services and 
is used by many traditional newspapers on a daily basis. While we can hypothesize 
that in the last few years the media coverage on climate change might have risen, 
due for instance to international climate strike movements, the recent emergence of 
the coronavirus disease 2019 (COVID-19) pandemic might have led to a decrease of 
attention on other relevant topics. 

Aims of this work were to: (1) assess trends in media coverage on climate change 
using tweets posted by main international newspapers based in United Kingdom 
(UK) and United States (US), and (2) identify the main topics discussed in these 
tweets using topic modeling. 


2 Dataset and Methods 


We downloaded all tweets posted from 2012 January 1** to 2021 December 315* 
from the official Twitter account of six widely known newspapers based in UK (The 
Guardian, The Independent and The Mirror) or US (The New York Times, The Wash- 
ington Post and The Wall Street Journal) leading to a collection of 3,275,499 tweets. 
Next, we determined which tweets were related to climate change and environmental 
awareness based on the presence of at least one of the following keywords: “climate 
change", “sustainability”, “earth day”, “plastic free", “global warming", "pollution", 
"environmentally friendly" or “renewable energy". We plotted the number of tweets 
on climate change posted by each newspaper during each year using R v. 4.1.2 [4]. 
We analyzed the association between the number of tweets on climate change and 
the whole number of tweets posted by each newspaper using Spearman's correlation 
analysis. For each year and for each newspaper, we computed and plotted the differ- 
ences in the number of posted tweets compared to the previous year, for either (a) 
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tweets related to climate change and (b) all tweets. Finally, we used the changepoint 
R package [5] to conduct an analysis aimed at identifying structural breaks, i.e. unex- 
pected changes in a time series. In many applications, it is reasonable to believe that 
there might be m breakpoints (especially if some exogenous event occurs) in which a 
shift in mean value is observed. The changepoint package estimates the breakpoints 
using several penalty criteria such as the Bayesian Information Criterion (BIC) or the 
Akaike Information Criterion (AIC). We estimated the breakpoints using the Binary 
Segmentation (BinSeg) method [6] implemented in the package. 

Lastly, we used tweets posted by The Guardian to perform topic modeling, a 
method for classification of text into topics. Preprocessing (including lemmatization, 
removal of stopwords and creation of the document term matrix) was conducted with 
tm [7] and quanteda [8] in R. We used two different approaches: 1) Latent Dirichlet 
Allocation (LDA) implemented in the textmineR R package [9]; and 2) Correlation 
Explanation (CorEx), an approach alternative to LDA that allows both unsupervised 
as well as semi-supervised topic modeling [10]. 


3 Results 
3.1 Analysis of Tweet Trends and Breakpoints 


Among 3,275,499 collected tweets, we identified 11,155 tweets related to climate 
change and environmental awareness. Figure 1A shows the number of tweets on 
climate change posted by each of the analyzed newspapers from 2012 to 2021, while 
Figure 1B the total number of tweets posted by each newspaper. 
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Fig. 1 Number of tweets on climate change (A) or total number of tweets (B) posted by the six 
newspapers from 2012 to 2021. 


For the majority of newspapers, the number of tweets on climate change increased 
from 2014 to 2019, saw a sharp decrease in 2020, in correspondence of the emergence 
of the COVID-19 pandemic, and a subsequent rise in 2021. On the other hand, the 
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Fig. 2 Year-over-year percentage changes of overall tweets and tweets on climate change. A: The 
Guardian, B: The Mirror, C: The Independent, D: The New York Times, E: The Washington Post, 
F, The Wall Street Journal. 


number of tweets on climate change posted by The Guardian showed a peak during 
2015 and a subsequent decrease. However, it must be noted that The Guardian is 
also the newspaper that showed a more pronounced decrease in the overall number 
of tweets. 

The number of tweets on climate change was significantly positively correlated 
with the overall number of tweets posted from 2012 to 2021 for four newspapers (The 
Guardian, Spearman’s rho = 0.95, p < 0.001; The Mirror, Spearman’s rho = 0.95, p 
< 0.001; The Independent, Spearman’s rho = 0.76, p = 0.016; The Washington Post, 
Spearman’s rho = 0.70, p = 0.031) but not for The New York Times (Spearman’s 
tho = 0.18, p = 0.63) or The Wall Street Journal (Spearman's rho = 0.49, p = 0.15). 
Year-over-year percentage changes among either tweets related to climate change or 
all posted tweets can be observed in Figure 2. 

Looking at Figure 2, we can observe a great variability in the posted number of 
tweets during the years, both for the total number of tweets and for the number of 
tweets on climate change. While the analysis aimed at identifying structural changes 
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Fig. 3 Structural changes in the time series of tweets related to climate change. A: The Guardian, 
B: The Mirror, C: The Independent, D: The New York Times, E: The Washington Post, F, The Wall 
Street Journal. The red line represents the years between two breakpoints. 


in the time series comprising tweets on climate change identified three or four 
breakpoints for all newspapers, wide variability was observed regarding the specific 
year in which these structural changes were identified (Figure 3). Despite the great 
variability, Figure 3 shows that even if a common breakpoint cannot be identified, 
2019 was a key year for five out of six newspapers (except for The Independent). 


3.2 Topic Modeling 


Finally, we exploited the topic modeling approach to identify and analyze the main 
topics discussed by newspapers in their tweets. Due to space limitations, we focus 
only on The Guardian since this newspaper showed a trend in contrast with the 
others. Data comes from 2,916 tweets posted by The Guardian analyzed using LDA 
and CorEx. For LDA, a range of 5-20 unsupervised topics was tested, with the most 
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interpretable results obtained with 10 topics (Table 1). The topic coherence ranged 
from 0.01 to 0.34 (mean: 0.13). For each topic, bi-gram topic labels were assigned 
with the labeling algorithm implemented in textmineR. We can observe that topics 
are related to politics or leaders (Topics 3, 7 and 10), environmental scientists or 
climate journalists (Topics 1 and 5), energy sources (Topics 4 and 8) and effects 
of climate change (Topics 2, 6 and 9). The intertopic distance map obtained with 
LDAvis is shown in Figure 4. The area of each circle is proportional to the relative 
prevalence of that topic in the corpus, while inter-topic distances are computed based 
on Jensen-Shannon divergence. 


Table 1 Top terms for the ten topics identified with LDA. 


dana . nuccitelli air pollution barack obama renewable energy john abraham 
dana pollution fight energy john 
dana nuccitelli air obama renewable trump 
nuccitelli air pollution trump renewable energy australia 
live study plan uk tackle 
trump finds battle sustainability abraham 
air pollution donald trump fossil fuel extreme weather pope francis 
pollution trump report world pollution 
air schoolstrike fossil paris study 
air pollution school ipcc leaders tackling 
uk great warns talks pope 
tackle donald stop deal scientists 
3 
10! 5 


Fig. 4 Intertopic distance map. 
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Finally, we conducted a semi-supervised topic modeling analysis based on an- 
chored words using CorEx. When anchoring a word to a topic, CorEx maximizes the 
mutual information between that word and the topic, thus guiding the topic model 
towards specific subsets of words. A model with 5 topics and three anchored words 
for each topic (Table 2) showed a total correlation (i.e. the measure maximized by 
CorEx when constructing the topic model) of 4.36. This value was higher compared 
to the one observed with an unsupervised CorEx analysis with the same number of 
topics (total correlation = 0.97, topics not shown due to space limits). Topics related 
to politics (Topic 3) and science (Topic 5) were found to be the most informative in 
our dataset based on the total correlation metric. 


Table 2 Topics with anchored words and examples of tweets. 


Topic Topic words Examples of tweets per topic 


1 school, strike, march, schoolstrik, climat- EPA wipes its climate change site day before 
estrikeuk, ukschoolstrik, schoolstrikeclim, march on Washington 
climatemarch, arabia, saudi 


2 ocean, ice, environment, john, dana, nuc- Chasing Ice filmmakers plumb the ’bottom- 

citelli, air, abraham, sea, reed less’ depths of climate change - new clip 
from @GuardianEco 

3 trump, obama, lead, donald, barack, Trump administration pollution rule strikes 
ivanka, brighton, repli, administr, pick final blow against environment 

4 plastic, fuel, oil, fossil, compani, pictur, Engaging with oil companies on climate 
wast, big, bay, photo change is futile 

5 studi, scientist, research, find, link, say, Microplastic pollution revealed ‘absolutely 
show, death, prematur, speci everywhere’ by new research 


The anchored words are reported in bold. 


4 Discussion 


The present study aims to evaluate how some of the most relevant British and 
American newspapers have given space to the topic of climate change on their 
Twitter page in the last decade. Apart from The Guardian, which shows a decreasing 
trend in the number of tweets related to climate change, all the other newspapers 
showed an overall growing trend, except during 2020. During this year, the number 
of tweets related to climate change declined for all six newspapers. This was most 
probably due to the COVID-19 outbreak that was massively covered by all media. 
By analyzing the breakpoints in Figure 3, it is possible to observe that 2019 was 
a relevant year for climate change. This is plausible considering that, starting from 
the end of 2018, the strikes launched by the Fridays for Future movement to raise 
awareness on the issue of climate change, gained high media coverage. 
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Our topic modeling analysis showed that the main topics defined using unsuper- 
vised models such as LDA are mostly related to politics, environmental scientists, 
energy sources and effects of climate change. While unsupervised models capture 
relevant topics, using CorEx we found a semi-supervised model to be able to reach 
a higher total correlation, which is a measure of informativeness of the topics, 
compared to an unsupervised model with the same number of topics. 

As future developments, we plan to extend our analyses to newspapers from other 
countries. We believe our work to be useful to gain more knowledge and awareness 
about the climate change topic and on how much space relevant newspapers have 
given to this issue on social media. Increasing the knowledge about the nature of the 
topics covered by newspapers will lay the basis for future studies aimed at evaluating 
public awareness on this highly relevant challenge. 
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